Commit Graph

1515 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
c448d4e4c0 Merge 7caee37c46 ("net: usb: usbnet: fix name regression") into android15-6.6-lts
Steps on the way to 6.6.59

Change-Id: I94b0e92b46abae9507c7f5eb877b48148cf48108
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2024-12-05 11:26:42 +00:00
Eric Dumazet
a7bdb19978 net: fix races in netdev_tx_sent_queue()/dev_watchdog()
[ Upstream commit 95ecba62e2 ]

Some workloads hit the infamous dev_watchdog() message:

"NETDEV WATCHDOG: eth0 (xxxx): transmit queue XX timed out"

It seems possible to hit this even for perfectly normal
BQL enabled drivers:

1) Assume a TX queue was idle for more than dev->watchdog_timeo
   (5 seconds unless changed by the driver)

2) Assume a big packet is sent, exceeding current BQL limit.

3) Driver ndo_start_xmit() puts the packet in TX ring,
   and netdev_tx_sent_queue() is called.

4) QUEUE_STATE_STACK_XOFF could be set from netdev_tx_sent_queue()
   before txq->trans_start has been written.

5) txq->trans_start is written later, from netdev_start_xmit()

    if (rc == NETDEV_TX_OK)
          txq_trans_update(txq)

dev_watchdog() running on another cpu could read the old
txq->trans_start, and then see QUEUE_STATE_STACK_XOFF, because 5)
did not happen yet.

To solve the issue, write txq->trans_start right before one XOFF bit
is set :

- _QUEUE_STATE_DRV_XOFF from netif_tx_stop_queue()
- __QUEUE_STATE_STACK_XOFF from netdev_tx_sent_queue()

From dev_watchdog(), we have to read txq->state before txq->trans_start.

Add memory barriers to enforce correct ordering.

In the future, we could avoid writing over txq->trans_start for normal
operations, and rename this field to txq->xoff_start_time.

Fixes: bec251bc8b ("net: no longer stop all TX queues in dev_watchdog()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20241015194118.3951657-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-01 01:58:29 +01:00
Greg Kroah-Hartman
ba4a8a450d This is the 6.6.55 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmcHpVcACgkQONu9yGCS
 aT5ruxAAiv7pOBKTn4uk6sJ67GjSh7tpjRdwHjekAmxnOX2JdtSDIfaTN1V0pAW9
 8xhbymu9b89iYov+XOV7Gia/1PlR0ZiLjrbKEowCL7cu1pXP1y+iDuAIFdGoe4l8
 3Ajy35xL1xNbHgsuazsz6xAeBSfHYHAXZ8rYdyq37ZTTeDz90U7MVK+eMV7WbYtX
 /xuWFdaGZR79WPV5/TTd5Psw0pMrfrYysn+p6HhOgzWvNjCqcODUMI1leRqK4GT9
 GVEIoCvdsuz3f4C/to1pKgzW61+1oHVlCpdi7Uw6BnCAfILK5ez3b9ISnvY0fQ4Z
 kmhUPA9ZHCYjtBR0E2+KaCmZvZBf/TIP63pd+aD2PRvkMxNgY93ZxyzuJRSVoxMx
 y/vsIXaC0fCO1dGsBz0MvS+gLleLe8DWheNOWIdHxdWBBm3gapzq+PxO4nuVVzZu
 iQGvXyrdLah1ZxnsDa2AE+++97sOJfHVRWxTsouaGa0pGMdr0GUpOp0F3Z3uG3fE
 ATxJ9tSNj9ozZ+3tImoZxNJuyjN9IGPkxqFQLplJpzky0uGlOFBuR9ZMPYGVrMiG
 /GfZfO8TCULo5+Hy8leEuwPu9vCfqk4k96EB5p0zv0tTuuZXyf3NDt4UzQbd/q7R
 RoHZIDgpO+XvUoyMhOkntkT/1MsU04zzSm767IgGJEl44/NEu5Y=
 =YeIl
 -----END PGP SIGNATURE-----

Merge 6.6.55 into android15-6.6-lts

Changes in 6.6.55
	static_call: Handle module init failure correctly in static_call_del_module()
	static_call: Replace pointless WARN_ON() in static_call_module_notify()
	jump_label: Simplify and clarify static_key_fast_inc_cpus_locked()
	jump_label: Fix static_key_slow_dec() yet again
	scsi: st: Fix input/output error on empty drive reset
	scsi: pm8001: Do not overwrite PCI queue mapping
	drm/amdgpu: Fix get each xcp macro
	mailbox: rockchip: fix a typo in module autoloading
	mailbox: bcm2835: Fix timeout during suspend mode
	ceph: remove the incorrect Fw reference check when dirtying pages
	ieee802154: Fix build error
	net: sparx5: Fix invalid timestamps
	net/mlx5: Fix error path in multi-packet WQE transmit
	net/mlx5: Added cond_resched() to crdump collection
	net/mlx5e: Fix NULL deref in mlx5e_tir_builder_alloc()
	net/mlx5e: Fix crash caused by calling __xfrm_state_delete() twice
	netfilter: uapi: NFTA_FLOWTABLE_HOOK is NLA_NESTED
	net: ieee802154: mcr20a: Use IRQF_NO_AUTOEN flag in request_irq()
	net: wwan: qcom_bam_dmux: Fix missing pm_runtime_disable()
	selftests: netfilter: Fix nft_audit.sh for newer nft binaries
	netfilter: nf_tables: prevent nf_skb_duplicated corruption
	Bluetooth: MGMT: Fix possible crash on mgmt_index_removed
	Bluetooth: L2CAP: Fix uaf in l2cap_connect
	Bluetooth: btmrvl: Use IRQF_NO_AUTOEN flag in request_irq()
	net: Add netif_get_gro_max_size helper for GRO
	net: Fix gso_features_check to check for both dev->gso_{ipv4_,}max_size
	net: ethernet: lantiq_etop: fix memory disclosure
	net: fec: Restart PPS after link state change
	net: fec: Reload PTP registers after link-state change
	net: avoid potential underflow in qdisc_pkt_len_init() with UFO
	net: add more sanity checks to qdisc_pkt_len_init()
	net: stmmac: dwmac4: extend timeout for VLAN Tag register busy bit check
	ipv4: ip_gre: Fix drops of small packets in ipgre_xmit
	net: test for not too small csum_start in virtio_net_hdr_to_skb()
	ppp: do not assume bh is held in ppp_channel_bridge_input()
	iomap: constrain the file range passed to iomap_file_unshare
	dt-bindings: net: xlnx,axi-ethernet: Add missing reg minItems
	sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start
	i2c: xiic: improve error message when transfer fails to start
	i2c: xiic: Try re-initialization on bus busy timeout
	loop: don't set QUEUE_FLAG_NOMERGES
	Bluetooth: hci_sock: Fix not validating setsockopt user input
	media: usbtv: Remove useless locks in usbtv_video_free()
	Bluetooth: ISO: Fix not validating setsockopt user input
	Bluetooth: L2CAP: Fix not validating setsockopt user input
	ASoC: atmel: mchp-pdmc: Skip ALSA restoration if substream runtime is uninitialized
	ALSA: mixer_oss: Remove some incorrect kfree_const() usages
	ALSA: hda/realtek: Fix the push button function for the ALC257
	cifs: Remove intermediate object of failed create reparse call
	ALSA: hda/generic: Unconditionally prefer preferred_dacs pairs
	ASoC: imx-card: Set card.owner to avoid a warning calltrace if SND=m
	cifs: Fix buffer overflow when parsing NFS reparse points
	cifs: Do not convert delimiter when parsing NFS-style symlinks
	ALSA: gus: Fix some error handling paths related to get_bpos() usage
	ALSA: hda/conexant: Fix conflicting quirk for System76 Pangolin
	wifi: ath9k: fix possible integer overflow in ath9k_get_et_stats()
	wifi: rtw89: avoid to add interface to list twice when SER
	wifi: ath9k_htc: Use __skb_set_length() for resetting urb before resubmit
	crypto: x86/sha256 - Add parentheses around macros' single arguments
	crypto: octeontx - Fix authenc setkey
	crypto: octeontx2 - Fix authenc setkey
	ice: Adjust over allocation of memory in ice_sched_add_root_node() and ice_sched_add_node()
	wifi: iwlwifi: mvm: Fix a race in scan abort flow
	wifi: iwlwifi: mvm: drop wrong STA selection in TX
	wifi: cfg80211: Set correct chandef when starting CAC
	net/xen-netback: prevent UAF in xenvif_flush_hash()
	net: hisilicon: hip04: fix OF node leak in probe()
	net: hisilicon: hns_dsaf_mac: fix OF node leak in hns_mac_get_info()
	net: hisilicon: hns_mdio: fix OF node leak in probe()
	ACPI: PAD: fix crash in exit_round_robin()
	ACPICA: Fix memory leak if acpi_ps_get_next_namepath() fails
	ACPICA: Fix memory leak if acpi_ps_get_next_field() fails
	e1000e: avoid failing the system during pm_suspend
	wifi: mt76: mt7915: disable tx worker during tx BA session enable/disable
	net: sched: consistently use rcu_replace_pointer() in taprio_change()
	Bluetooth: btusb: Add Realtek RTL8852C support ID 0x0489:0xe122
	Bluetooth: btrtl: Set msft ext address filter quirk for RTL8852B
	ACPI: video: Add force_vendor quirk for Panasonic Toughbook CF-18
	ACPI: CPPC: Add support for setting EPP register in FFH
	blk_iocost: fix more out of bound shifts
	wifi: ath12k: fix array out-of-bound access in SoC stats
	wifi: ath11k: fix array out-of-bound access in SoC stats
	wifi: rtw88: select WANT_DEV_COREDUMP
	ACPI: EC: Do not release locks during operation region accesses
	ACPICA: check null return of ACPI_ALLOCATE_ZEROED() in acpi_db_convert_to_package()
	tipc: guard against string buffer overrun
	net: mvpp2: Increase size of queue_name buffer
	bnxt_en: Extend maximum length of version string by 1 byte
	ipv4: Check !in_dev earlier for ioctl(SIOCSIFADDR).
	wifi: rtw89: correct base HT rate mask for firmware
	ipv4: Mask upper DSCP bits and ECN bits in NETLINK_FIB_LOOKUP family
	net: atlantic: Avoid warning about potential string truncation
	crypto: simd - Do not call crypto_alloc_tfm during registration
	netpoll: Ensure clean state on setup failures
	tcp: avoid reusing FIN_WAIT2 when trying to find port in connect() process
	wifi: iwlwifi: mvm: use correct key iteration
	wifi: iwlwifi: mvm: avoid NULL pointer dereference
	wifi: mac80211: fix RCU list iterations
	ACPICA: iasl: handle empty connection_node
	proc: add config & param to block forcing mem writes
	drivers/perf: arm_spe: Use perf_allow_kernel() for permissions
	can: netlink: avoid call to do_set_data_bittiming callback with stale can_priv::ctrlmode
	wifi: mt76: mt7915: add dummy HW offload of IEEE 802.11 fragmentation
	wifi: mt76: mt7915: hold dev->mt76.mutex while disabling tx worker
	wifi: mwifiex: Fix memcpy() field-spanning write warning in mwifiex_cmd_802_11_scan_ext()
	nfp: Use IRQF_NO_AUTOEN flag in request_irq()
	ALSA: usb-audio: Add input value sanity checks for standard types
	x86/ioapic: Handle allocation failures gracefully
	ALSA: usb-audio: Define macros for quirk table entries
	ALSA: usb-audio: Replace complex quirk lines with macros
	ALSA: usb-audio: Add logitech Audio profile quirk
	ASoC: codecs: wsa883x: Handle reading version failure
	tools/x86/kcpuid: Protect against faulty "max subleaf" values
	x86/pkeys: Add PKRU as a parameter in signal handling functions
	x86/pkeys: Restore altstack access in sigreturn()
	x86/kexec: Add EFI config table identity mapping for kexec kernel
	ALSA: asihpi: Fix potential OOB array access
	ALSA: hdsp: Break infinite MIDI input flush loop
	tools/nolibc: powerpc: limit stack-protector workaround to GCC
	selftests/nolibc: avoid passing NULL to printf("%s")
	x86/syscall: Avoid memcpy() for ia32 syscall_get_arguments()
	hwmon: (nct6775) add G15CF to ASUS WMI monitoring list
	fbdev: efifb: Register sysfs groups through driver core
	fbdev: pxafb: Fix possible use after free in pxafb_task()
	rcuscale: Provide clear error when async specified without primitives
	power: reset: brcmstb: Do not go into infinite loop if reset fails
	iommu/vt-d: Always reserve a domain ID for identity setup
	iommu/vt-d: Fix potential lockup if qi_submit_sync called with 0 count
	drm/stm: Avoid use-after-free issues with crtc and plane
	drm/amdgpu: disallow multiple BO_HANDLES chunks in one submit
	drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointer
	drm/amd/display: Add null check for top_pipe_to_program in commit_planes_for_stream
	ata: pata_serverworks: Do not use the term blacklist
	ata: sata_sil: Rename sil_blacklist to sil_quirks
	HID: Ignore battery for all ELAN I2C-HID devices
	drm/amd/display: Handle null 'stream_status' in 'planes_changed_for_existing_stream'
	drm/amd/display: Check null pointers before using dc->clk_mgr
	drm/amd/display: Add null check for 'afb' in amdgpu_dm_plane_handle_cursor_update (v2)
	drm/amd/display: fix double free issue during amdgpu module unload
	jfs: UBSAN: shift-out-of-bounds in dbFindBits
	jfs: Fix uaf in dbFreeBits
	jfs: check if leafidx greater than num leaves per dmap tree
	scsi: smartpqi: correct stream detection
	drm/msm/adreno: Assign msm_gpu->pdev earlier to avoid nullptrs
	jfs: Fix uninit-value access of new_ea in ea_buffer
	drm/amdgpu: add raven1 gfxoff quirk
	drm/amdgpu: enable gfxoff quirk on HP 705G4
	drm/amdkfd: Fix resource leak in criu restore queue
	HID: multitouch: Add support for Thinkpad X12 Gen 2 Kbd Portfolio
	platform/x86: touchscreen_dmi: add nanote-next quirk
	drm/stm: ltdc: reset plane transparency after plane disable
	drm/amd/display: Check stream before comparing them
	drm/amd/display: Check link_res->hpo_dp_link_enc before using it
	drm/amd/display: Fix index out of bounds in DCN30 degamma hardware format translation
	drm/amd/display: Fix index out of bounds in degamma hardware format translation
	drm/amd/display: Fix index out of bounds in DCN30 color transformation
	drm/amd/display: Avoid overflow assignment in link_dp_cts
	drm/amd/display: Initialize get_bytes_per_element's default to 1
	drm/printer: Allow NULL data in devcoredump printer
	perf,x86: avoid missing caller address in stack traces captured in uprobe
	scsi: aacraid: Rearrange order of struct aac_srb_unit
	scsi: lpfc: Update PRLO handling in direct attached topology
	drm/amdgpu: fix unchecked return value warning for amdgpu_gfx
	perf: Fix event_function_call() locking
	scsi: NCR5380: Initialize buffer for MSG IN and STATUS transfers
	drm/radeon/r100: Handle unknown family in r100_cp_init_microcode()
	drm/amdgpu: Block MMR_READ IOCTL in reset
	drm/amdgpu/gfx9: use rlc safe mode for soft recovery
	drm/amd/pm: ensure the fw_info is not null before using it
	of/irq: Refer to actual buffer size in of_irq_parse_one()
	powerpc/pseries: Use correct data types from pseries_hp_errorlog struct
	drm/amdgpu/gfx11: use rlc safe mode for soft recovery
	drm/amdgpu/gfx10: use rlc safe mode for soft recovery
	platform/x86: lenovo-ymc: Ignore the 0x0 state
	ksmbd: add refcnt to ksmbd_conn struct
	bpf: Make the pointer returned by iter next method valid
	ext4: ext4_search_dir should return a proper error
	ext4: avoid use-after-free in ext4_ext_show_leaf()
	ext4: fix i_data_sem unlock order in ext4_ind_migrate()
	bpftool: Fix undefined behavior caused by shifting into the sign bit
	iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release
	bpftool: Fix undefined behavior in qsort(NULL, 0, ...)
	spi: spi-imx: Fix pm_runtime_set_suspended() with runtime pm enabled
	spi: spi-cadence: Use helper function devm_clk_get_enabled()
	spi: spi-cadence: Fix pm_runtime_set_suspended() with runtime pm enabled
	spi: spi-cadence: Fix missing spi_controller_is_target() check
	selftest: hid: add missing run-hid-tools-tests.sh
	spi: s3c64xx: fix timeout counters in flush_fifo
	selftests: breakpoints: use remaining time to check if suspend succeed
	accel/ivpu: Add missing MODULE_FIRMWARE metadata
	spi: rpc-if: Add missing MODULE_DEVICE_TABLE
	perf callchain: Fix stitch LBR memory leaks
	perf: Really fix event_function_call() locking
	selftests: vDSO: fix vDSO name for powerpc
	selftests: vDSO: fix vdso_config for powerpc
	selftests: vDSO: fix vDSO symbols lookup for powerpc64
	selftests/mm: fix charge_reserved_hugetlb.sh test
	powerpc/vdso: Fix VDSO data access when running in a non-root time namespace
	selftests: vDSO: fix ELF hash table entry size for s390x
	selftests: vDSO: fix vdso_config for s390
	Revert "ALSA: hda: Conditionally use snooping for AMD HDMI"
	platform/x86: ISST: Fix the KASAN report slab-out-of-bounds bug
	i2c: stm32f7: Do not prepare/unprepare clock during runtime suspend/resume
	i2c: qcom-geni: Use IRQF_NO_AUTOEN flag in request_irq()
	i2c: xiic: Wait for TX empty to avoid missed TX NAKs
	media: i2c: ar0521: Use cansleep version of gpiod_set_value()
	i2c: xiic: Fix pm_runtime_set_suspended() with runtime pm enabled
	i2c: designware: fix controller is holding SCL low while ENABLE bit is disabled
	rust: sync: require `T: Sync` for `LockedBy::access`
	ovl: fail if trusted xattrs are needed but caller lacks permission
	firmware: tegra: bpmp: Drop unused mbox_client_to_bpmp()
	memory: tegra186-emc: drop unused to_tegra186_emc()
	dt-bindings: clock: exynos7885: Fix duplicated binding
	spi: bcm63xx: Fix module autoloading
	spi: bcm63xx: Fix missing pm_runtime_disable()
	power: supply: hwmon: Fix missing temp1_max_alarm attribute
	perf/core: Fix small negative period being ignored
	parisc: Fix itlb miss handler for 64-bit programs
	drm/mediatek: ovl_adaptor: Add missing of_node_put()
	drm: Consistently use struct drm_mode_rect for FB_DAMAGE_CLIPS
	ALSA: hda/tas2781: Add new quirk for Lenovo Y990 Laptop
	ALSA: core: add isascii() check to card ID generator
	ALSA: usb-audio: Add delay quirk for VIVO USB-C HEADSET
	ALSA: usb-audio: Add native DSD support for Luxman D-08u
	ALSA: line6: add hw monitor volume control to POD HD500X
	ALSA: hda/realtek: Add quirk for Huawei MateBook 13 KLV-WX9
	ALSA: hda/realtek: Add a quirk for HP Pavilion 15z-ec200
	ext4: no need to continue when the number of entries is 1
	ext4: correct encrypted dentry name hash when not casefolded
	ext4: fix slab-use-after-free in ext4_split_extent_at()
	ext4: propagate errors from ext4_find_extent() in ext4_insert_range()
	ext4: fix incorrect tid assumption in ext4_fc_mark_ineligible()
	ext4: dax: fix overflowing extents beyond inode size when partially writing
	ext4: fix incorrect tid assumption in __jbd2_log_wait_for_space()
	ext4: drop ppath from ext4_ext_replay_update_ex() to avoid double-free
	ext4: aovid use-after-free in ext4_ext_insert_extent()
	ext4: fix double brelse() the buffer of the extents path
	ext4: fix timer use-after-free on failed mount
	ext4: update orig_path in ext4_find_extent()
	ext4: fix incorrect tid assumption in ext4_wait_for_tail_page_commit()
	ext4: fix incorrect tid assumption in jbd2_journal_shrink_checkpoint_list()
	ext4: fix fast commit inode enqueueing during a full journal commit
	ext4: use handle to mark fc as ineligible in __track_dentry_update()
	ext4: mark fc as ineligible using an handle in ext4_xattr_set()
	parisc: Fix 64-bit userspace syscall path
	parisc: Allow mmap(MAP_STACK) memory to automatically expand upwards
	parisc: Fix stack start for ADDR_NO_RANDOMIZE personality
	drm/rockchip: vop: clear DMA stop bit on RK3066
	of: address: Report error on resource bounds overflow
	of/irq: Support #msi-cells=<0> in of_msi_get_domain
	drm: omapdrm: Add missing check for alloc_ordered_workqueue
	resource: fix region_intersects() vs add_memory_driver_managed()
	jbd2: stop waiting for space when jbd2_cleanup_journal_tail() returns error
	jbd2: correctly compare tids with tid_geq function in jbd2_fc_begin_commit
	mm: krealloc: consider spare memory for __GFP_ZERO
	ocfs2: fix the la space leak when unmounting an ocfs2 volume
	ocfs2: fix uninit-value in ocfs2_get_block()
	ocfs2: reserve space for inline xattr before attaching reflink tree
	ocfs2: cancel dqi_sync_work before freeing oinfo
	ocfs2: remove unreasonable unlock in ocfs2_read_blocks
	ocfs2: fix null-ptr-deref when journal load failed.
	ocfs2: fix possible null-ptr-deref in ocfs2_set_buffer_uptodate
	arm64: fix selection of HAVE_DYNAMIC_FTRACE_WITH_ARGS
	arm64: Subscribe Microsoft Azure Cobalt 100 to erratum 3194386
	riscv: define ILLEGAL_POINTER_VALUE for 64bit
	exfat: fix memory leak in exfat_load_bitmap()
	perf python: Disable -Wno-cast-function-type-mismatch if present on clang
	perf hist: Update hist symbol when updating maps
	nfsd: fix delegation_blocked() to block correctly for at least 30 seconds
	nfsd: map the EBADMSG to nfserr_io to avoid warning
	NFSD: Fix NFSv4's PUTPUBFH operation
	i3c: master: svc: Fix use after free vulnerability in svc_i3c_master Driver Due to Race Condition
	RDMA/mana_ib: use the correct page size for mapping user-mode doorbell page
	riscv: Fix kernel stack size when KASAN is enabled
	aoe: fix the potential use-after-free problem in more places
	media: ov5675: Fix power on/off delay timings
	clk: rockchip: fix error for unknown clocks
	remoteproc: k3-r5: Fix error handling when power-up failed
	clk: qcom: dispcc-sm8250: use CLK_SET_RATE_PARENT for branch clocks
	media: sun4i_csi: Implement link validate for sun4i_csi subdev
	clk: qcom: gcc-sm8450: Do not turn off PCIe GDSCs during gdsc_disable()
	media: uapi/linux/cec.h: cec_msg_set_reply_to: zero flags
	clk: qcom: clk-rpmh: Fix overflow in BCM vote
	clk: samsung: exynos7885: Update CLKS_NR_FSYS after bindings fix
	clk: qcom: gcc-sm8150: De-register gcc_cpuss_ahb_clk_src
	media: venus: fix use after free bug in venus_remove due to race condition
	clk: qcom: gcc-sm8250: Do not turn off PCIe GDSCs during gdsc_disable()
	media: qcom: camss: Remove use_count guard in stop_streaming
	media: qcom: camss: Fix ordering of pm_runtime_enable
	clk: qcom: gcc-sc8180x: Fix the sdcc2 and sdcc4 clocks freq table
	clk: qcom: clk-alpha-pll: Fix CAL_L_VAL override for LUCID EVO PLL
	smb: client: use actual path when queryfs
	smb3: fix incorrect mode displayed for read-only files
	iio: magnetometer: ak8975: Fix reading for ak099xx sensors
	vrf: revert "vrf: Remove unnecessary RCU-bh critical section"
	gso: fix udp gso fraglist segmentation after pull from frag_list
	tomoyo: fallback to realpath if symlink's pathname does not exist
	net: stmmac: Fix zero-division error when disabling tc cbs
	rtc: at91sam9: fix OF node leak in probe() error path
	Input: adp5589-keys - fix NULL pointer dereference
	Input: adp5589-keys - fix adp5589_gpio_get_value()
	cachefiles: fix dentry leak in cachefiles_open_file()
	ACPI: resource: Add Asus Vivobook X1704VAP to irq1_level_low_skip_override[]
	ACPI: resource: Add Asus ExpertBook B2502CVA to irq1_level_low_skip_override[]
	btrfs: fix a NULL pointer dereference when failed to start a new trasacntion
	btrfs: send: fix invalid clone operation for file that got its size decreased
	btrfs: wait for fixup workers before stopping cleaner kthread during umount
	cpufreq: Avoid a bad reference count on CPU node
	gpio: davinci: fix lazy disable
	net: pcs: xpcs: fix the wrong register that was written back
	Bluetooth: hci_event: Align BR/EDR JUST_WORKS paring with LE
	mac802154: Fix potential RCU dereference issue in mac802154_scan_worker
	ceph: fix cap ref leak via netfs init_request
	tracing/hwlat: Fix a race during cpuhp processing
	tracing/timerlat: Drop interface_lock in stop_kthread()
	tracing/timerlat: Fix a race during cpuhp processing
	tracing/timerlat: Fix duplicated kthread creation due to CPU online/offline
	rtla: Fix the help text in osnoise and timerlat top tools
	close_range(): fix the logics in descriptor table trimming
	drm/i915/gem: fix bitwise and logical AND mixup
	drm/sched: Add locking to drm_sched_entity_modify_sched
	drm/amd/display: Add HDR workaround for specific eDP
	drm/amd/display: Fix system hang while resume with TBT monitor
	cpufreq: intel_pstate: Make hwp_notify_lock a raw spinlock
	kconfig: qconf: fix buffer overflow in debug links
	platform/x86: x86-android-tablets: Create a platform_device from module_init()
	platform/x86: x86-android-tablets: Fix use after free on platform_device_register() errors
	i2c: create debugfs entry per adapter
	i2c: core: Lock address during client device instantiation
	i2c: synquacer: Remove a clk reference from struct synquacer_i2c
	i2c: synquacer: Deal with optional PCLK correctly
	arm64: cputype: Add Neoverse-N3 definitions
	arm64: errata: Expand speculative SSBS workaround once more
	io_uring/net: harden multishot termination case for recv
	uprobes: fix kernel info leak via "[uprobes]" vma
	mm: z3fold: deprecate CONFIG_Z3FOLD
	drm/amd/display: Allow backlight to go below `AMDGPU_DM_DEFAULT_MIN_BACKLIGHT`
	build-id: require program headers to be right after ELF header
	lib/buildid: harden build ID parsing logic
	sched: psi: fix bogus pressure spikes from aggregation race
	net: mana: Enable MANA driver on ARM64 with 4K page size
	net: mana: Add support for page sizes other than 4KB on ARM64
	RDMA/mana_ib: use the correct page table index based on hardware page size
	media: i2c: imx335: Enable regulator supplies
	media: imx335: Fix reset-gpio handling
	remoteproc: k3-r5: Acquire mailbox handle during probe routine
	remoteproc: k3-r5: Delay notification of wakeup event
	dt-bindings: clock: qcom: Add missing UFS QREF clocks
	dt-bindings: clock: qcom: Add GPLL9 support on gcc-sc8180x
	iio: pressure: bmp280: Allow multiple chips id per family of devices
	iio: pressure: bmp280: Improve indentation and line wrapping
	iio: pressure: bmp280: Use BME prefix for BME280 specifics
	iio: pressure: bmp280: Fix regmap for BMP280 device
	iio: pressure: bmp280: Fix waiting time for BMP3xx configuration
	r8169: Fix spelling mistake: "tx_underun" -> "tx_underrun"
	r8169: add tally counter fields added with RTL8125
	clk: qcom: gcc-sc8180x: Add GPLL9 support
	ACPI: battery: Simplify battery hook locking
	ACPI: battery: Fix possible crash when unregistering a battery hook
	btrfs: relocation: return bool from btrfs_should_ignore_reloc_root
	btrfs: relocation: constify parameters where possible
	btrfs: drop the backref cache during relocation if we commit
	drm/rockchip: vop: enable VOP_FEATURE_INTERNAL_RGB on RK3066
	Revert "drm/amd/display: Skip Recompute DSC Params if no Stream on Link"
	ubifs: ubifs_symlink: Fix memleak of inode->i_link in error path
	netfilter: nf_tables: fix memleak in map from abort path
	netfilter: nf_tables: restore set elements when delete set fails
	net: dsa: fix netdev_priv() dereference before check on non-DSA netdevice events
	iommufd: Fix protection fault in iommufd_test_syz_conv_iova
	drm/bridge: adv7511: fix crash on irq during probe
	efi/unaccepted: touch soft lockup during memory accept
	platform/x86: think-lmi: Fix password opcode ordering for workstations
	null_blk: Remove usage of the deprecated ida_simple_xx() API
	null_blk: fix null-ptr-dereference while configuring 'power' and 'submit_queues'
	net: stmmac: move the EST lock to struct stmmac_priv
	rxrpc: Fix a race between socket set up and I/O thread creation
	vhost/scsi: null-ptr-dereference in vhost_scsi_get_req()
	crypto: octeontx* - Select CRYPTO_AUTHENC
	drm/amd/display: Revert Avoid overflow assignment
	perf report: Fix segfault when 'sym' sort key is not used
	drm/amd/display: enable_hpo_dp_link_output: Check link_res->hpo_dp_link_enc before using it
	null_blk: Fix return value of nullb_device_power_store()
	Revert "ubifs: ubifs_symlink: Fix memleak of inode->i_link in error path"
	perf python: Allow checking for the existence of warning options in clang
	Linux 6.6.55

Applicable to GKI build:
  33f3e83227 jump_label: Simplify and clarify static_key_fast_inc_cpus_locked() [1 file, +11/-9]
  86fdd18064 jump_label: Fix static_key_slow_dec() yet again [1 file, +27/-7]
  8691a82abf netfilter: uapi: NFTA_FLOWTABLE_HOOK is NLA_NESTED [1 file, +1/-1]
  4e3542f40f netfilter: nf_tables: prevent nf_skb_duplicated corruption [2 files, +10/-4]
  4883296505 Bluetooth: MGMT: Fix possible crash on mgmt_index_removed [1 file, +14/-9]
  b90907696c Bluetooth: L2CAP: Fix uaf in l2cap_connect [3 files, +3/-9]
  dae9b99bd2 net: Add netif_get_gro_max_size helper for GRO [2 files, +11/-7]
  718b663403 net: Fix gso_features_check to check for both dev->gso_{ipv4_,}max_size [2 files, +10/-1]
  25ab0b87db net: avoid potential underflow in qdisc_pkt_len_init() with UFO [1 file, +1/-1]
  9b0ee571d2 net: add more sanity checks to qdisc_pkt_len_init() [1 file, +7/-3]
  ea8cad4ca5 ipv4: ip_gre: Fix drops of small packets in ipgre_xmit [1 file, +3/-3]
  d9dfd41e32 net: test for not too small csum_start in virtio_net_hdr_to_skb() [1 file, +3/-1]
  f9620e2a66 ppp: do not assume bh is held in ppp_channel_bridge_input() [1 file, +2/-2]
  b66ff9a3fc loop: don't set QUEUE_FLAG_NOMERGES [1 file, +2/-13]
  0c18a64039 Bluetooth: hci_sock: Fix not validating setsockopt user input [1 file, +8/-13]
  6a6baa1ee7 Bluetooth: ISO: Fix not validating setsockopt user input [1 file, +12/-24]
  28234f8ab6 Bluetooth: L2CAP: Fix not validating setsockopt user input [1 file, +20/-32]
  1ab2cfe197 blk_iocost: fix more out of bound shifts [1 file, +5/-3]
  12d26aa7fd tipc: guard against string buffer overrun [1 file, +6/-2]
  d4c4653b60 ipv4: Check !in_dev earlier for ioctl(SIOCSIFADDR). [1 file, +2/-4]
  f989162f55 ipv4: Mask upper DSCP bits and ECN bits in NETLINK_FIB_LOOKUP family [1 file, +1/-1]
  5cce1c07bf tcp: avoid reusing FIN_WAIT2 when trying to find port in connect() process [1 file, +3/-0]
  b4f8240bc3 can: netlink: avoid call to do_set_data_bittiming callback with stale can_priv::ctrlmode [1 file, +51/-51]
  864f68a242 ALSA: usb-audio: Add input value sanity checks for standard types [2 files, +28/-8]
  70d5e30b0a ALSA: usb-audio: Add logitech Audio profile quirk [1 file, +6/-0]
  4ee08b4a72 drm/printer: Allow NULL data in devcoredump printer [2 files, +61/-6]
  66a403d89b perf: Fix event_function_call() locking [1 file, +5/-4]
  fe2c86e192 of/irq: Refer to actual buffer size in of_irq_parse_one() [1 file, +2/-2]
  b111ae42bb bpf: Make the pointer returned by iter next method valid [1 file, +22/-4]
  1fe2852720 ext4: ext4_search_dir should return a proper error [1 file, +7/-5]
  34b2096380 ext4: avoid use-after-free in ext4_ext_show_leaf() [1 file, +4/-5]
  d43776b907 ext4: fix i_data_sem unlock order in ext4_ind_migrate() [1 file, +1/-1]
  390b9e54cd iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release [1 file, +9/-1]
  9629c0c3e8 perf: Really fix event_function_call() locking [1 file, +8/-5]
  bf47be5479 ovl: fail if trusted xattrs are needed but caller lacks permission [1 file, +33/-5]
  028258156f firmware: tegra: bpmp: Drop unused mbox_client_to_bpmp() [1 file, +0/-6]
  ff580d0130 memory: tegra186-emc: drop unused to_tegra186_emc() [1 file, +0/-5]
  9fca08c06a perf/core: Fix small negative period being ignored [1 file, +5/-1]
  c923bc8746 drm: Consistently use struct drm_mode_rect for FB_DAMAGE_CLIPS [1 file, +1/-1]
  aba1be9a80 ALSA: core: add isascii() check to card ID generator [1 file, +10/-4]
  9d125aab4c ALSA: usb-audio: Add delay quirk for VIVO USB-C HEADSET [1 file, +2/-0]
  228a8b952c ALSA: usb-audio: Add native DSD support for Luxman D-08u [1 file, +2/-0]
  2d64e7dada ext4: no need to continue when the number of entries is 1 [1 file, +1/-1]
  a56e5f389d ext4: correct encrypted dentry name hash when not casefolded [1 file, +11/-3]
  8fe117790b ext4: fix slab-use-after-free in ext4_split_extent_at() [1 file, +20/-1]
  f4308d8ee3 ext4: propagate errors from ext4_find_extent() in ext4_insert_range() [1 file, +1/-0]
  8c762b4e19 ext4: fix incorrect tid assumption in ext4_fc_mark_ineligible() [1 file, +11/-4]
  5efccdee4a ext4: dax: fix overflowing extents beyond inode size when partially writing [1 file, +4/-4]
  93051d16b3 ext4: fix incorrect tid assumption in __jbd2_log_wait_for_space() [1 file, +5/-2]
  1b558006d9 ext4: drop ppath from ext4_ext_replay_update_ex() to avoid double-free [1 file, +10/-11]
  8162ee5d94 ext4: aovid use-after-free in ext4_ext_insert_extent() [1 file, +1/-0]
  68a69cf606 ext4: fix double brelse() the buffer of the extents path [1 file, +1/-0]
  9203817ba4 ext4: fix timer use-after-free on failed mount [1 file, +1/-1]
  f55ecc58d0 ext4: update orig_path in ext4_find_extent() [2 files, +2/-2]
  80dccb81b7 ext4: fix incorrect tid assumption in ext4_wait_for_tail_page_commit() [1 file, +7/-4]
  1552199ace ext4: fix incorrect tid assumption in jbd2_journal_shrink_checkpoint_list() [1 file, +5/-2]
  d13a3558e8 ext4: fix fast commit inode enqueueing during a full journal commit [2 files, +15/-2]
  c5771f1c48 ext4: use handle to mark fc as ineligible in __track_dentry_update() [1 file, +11/-8]
  89bbc55d6b ext4: mark fc as ineligible using an handle in ext4_xattr_set() [1 file, +2/-1]
  a17dfde577 parisc: Fix stack start for ADDR_NO_RANDOMIZE personality [1 file, +2/-1]
  d657d28641 of: address: Report error on resource bounds overflow [1 file, +5/-0]
  0022085f11 of/irq: Support #msi-cells=<0> in of_msi_get_domain [1 file, +7/-27]
  393331e16c resource: fix region_intersects() vs add_memory_driver_managed() [1 file, +50/-8]
  1c62dc0d82 jbd2: stop waiting for space when jbd2_cleanup_journal_tail() returns error [1 file, +5/-2]
  fd34962434 jbd2: correctly compare tids with tid_geq function in jbd2_fc_begin_commit [1 file, +1/-1]
  e3a9fc1520 mm: krealloc: consider spare memory for __GFP_ZERO [1 file, +7/-0]
  bf0b3b3525 exfat: fix memory leak in exfat_load_bitmap() [1 file, +5/-5]
  af3122f5fd gso: fix udp gso fraglist segmentation after pull from frag_list [1 file, +20/-2]
  0f41f383b5 cpufreq: Avoid a bad reference count on CPU node [1 file, +1/-5]
  830c03e58b Bluetooth: hci_event: Align BR/EDR JUST_WORKS paring with LE [1 file, +5/-8]
  e676e4ea76 mac802154: Fix potential RCU dereference issue in mac802154_scan_worker [1 file, +3/-1]
  a8023f8b55 close_range(): fix the logics in descriptor table trimming [3 files, +52/-83]
  4a2be5a728 i2c: create debugfs entry per adapter [2 files, +13/-0]
  316be4911f i2c: core: Lock address during client device instantiation [2 files, +31/-0]
  9a3e9aab60 arm64: cputype: Add Neoverse-N3 definitions [1 file, +2/-0]
  24f7989ed2 io_uring/net: harden multishot termination case for recv [1 file, +3/-1]
  5b981d8335 uprobes: fix kernel info leak via "[uprobes]" vma [1 file, +1/-1]
  f941d77962 build-id: require program headers to be right after ELF header [1 file, +14/-0]
  c83a80d8b8 lib/buildid: harden build ID parsing logic [1 file, +44/-32]
GKI (arm64) relevant 79 out of 385 changes, affecting 92 files +798/-486

Change-Id: I44c3665935ccd89562de4f0c904061025569e953
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2024-10-10 12:10:42 +00:00
Daniel Borkmann
718b663403 net: Fix gso_features_check to check for both dev->gso_{ipv4_,}max_size
[ Upstream commit e609c959a9 ]

Commit 24ab059d2e ("net: check dev->gso_max_size in gso_features_check()")
added a dev->gso_max_size test to gso_features_check() in order to fall
back to GSO when needed.

This was added as it was noticed that some drivers could misbehave if TSO
packets get too big. However, the check doesn't respect dev->gso_ipv4_max_size
limit. For instance, a device could be configured with BIG TCP for IPv4,
but not IPv6.

Therefore, add a netif_get_gso_max_size() equivalent to netif_get_gro_max_size()
and use the helper to respect both limits before falling back to GSO engine.

Fixes: 24ab059d2e ("net: check dev->gso_max_size in gso_features_check()")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240923212242.15669-2-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-10 11:57:16 +02:00
Daniel Borkmann
dae9b99bd2 net: Add netif_get_gro_max_size helper for GRO
[ Upstream commit e8d4d34df7 ]

Add a small netif_get_gro_max_size() helper which returns the maximum IPv4
or IPv6 GRO size of the netdevice.

We later add a netif_get_gso_max_size() equivalent as well for GSO, so that
these helpers can be used consistently instead of open-coded checks.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240923212242.15669-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Stable-dep-of: e609c959a9 ("net: Fix gso_features_check to check for both dev->gso_{ipv4_,}max_size")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-10 11:57:16 +02:00
Ramji Jiyani
4db218b4ff ANDROID: always add the struct wireless_dev * to struct net_device
When Android moved the wifi drivers to be a vendor driver, it disabled
CFG80211 from the build configuration, yet that needs to be enabled in
the vendor module build.  As the struct net_device is defined in the
core kernel image, both builds needs to have the same structure size, so
always enable it in the structure and protect any potential vendor
changes from showing up in the CRC checker by maing it a void * as far
as it is concerned.

Bug: 274416891
Test: TH

Fixes: c304eddcec ("net: wrap the wireless pointers in struct net_device in an ifdef")
Change-Id: I7c2a10da63b6022abbac78a3a0d48c2fd405f42c
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
(cherry picked from commit 006d1fc450)
Signed-off-by: Ramji Jiyani <ramjiyani@google.com>
2024-04-10 21:58:17 +00:00
Greg Kroah-Hartman
5ecf6178d1 ANDROID: GKI: the "reusachtig" padding sync with android15-6.1
Add the initial set of ABI padding fields in android15-6.6 based on what
is in the android15-6.1 branch.

Bug: 151154716
Change-Id: Icdb394863b2911389bfdced0fd1ea20236ca4ce1
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2024-03-01 15:57:09 +00:00
Daniel Borkmann
6ae7b3fc7a net: Move {l,t,d}stats allocation to core and convert veth & vrf
[ Upstream commit 34d21de99c ]

Move {l,t,d}stats allocation to the core and let netdevs pick the stats
type they need. That way the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc) - all happening in the core.

Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231114004220.6495-3-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Stable-dep-of: 024ee930cb ("bpf: Fix dev's rx stats for bpf_redirect_peer traffic")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-03 07:33:04 +01:00
Daniel Borkmann
95f068b0fd net, vrf: Move dstats structure to core
[ Upstream commit 79e0c5be8c ]

Just move struct pcpu_dstats out of the vrf into the core, and streamline
the field names slightly, so they better align with the {t,l}stats ones.

No functional change otherwise. A conversion of the u64s to u64_stats_t
could be done at a separate point in future. This move is needed as we are
moving the {t,l,d}stats allocation/freeing to the core.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231114004220.6495-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Stable-dep-of: 024ee930cb ("bpf: Fix dev's rx stats for bpf_redirect_peer traffic")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-03 07:33:04 +01:00
Eric Dumazet
c8670b7b8d net: add DEV_STATS_READ() helper
[ Upstream commit 0b068c714c ]

Companion of DEV_STATS_INC() & DEV_STATS_ADD().

This is going to be used in the series.

Use it in macsec_get_stats64().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stable-dep-of: ff672b9ffe ("ipvlan: properly track tx_errors")
Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-20 11:59:04 +01:00
Jakub Kicinski
d07b7b32da pull-request: bpf-next 2023-08-03
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRdM/uy1Ege0+EN1fNar9k/UBDW4wUCZMvevwAKCRBar9k/UBDW
 42Z0AP90hLZ9OmoghYAlALHLl8zqXuHCV8OeFXR5auqG+kkcCwEAx6h99vnh4zgP
 Tngj6Yid60o39/IZXXblhV37HfSiyQ8=
 =/kVE
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2023-08-03

We've added 54 non-merge commits during the last 10 day(s) which contain
a total of 84 files changed, 4026 insertions(+), 562 deletions(-).

The main changes are:

1) Add SO_REUSEPORT support for TC bpf_sk_assign from Lorenz Bauer,
   Daniel Borkmann

2) Support new insns from cpu v4 from Yonghong Song

3) Non-atomically allocate freelist during prefill from YiFei Zhu

4) Support defragmenting IPv(4|6) packets in BPF from Daniel Xu

5) Add tracepoint to xdp attaching failure from Leon Hwang

6) struct netdev_rx_queue and xdp.h reshuffling to reduce
   rebuild time from Jakub Kicinski

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (54 commits)
  net: invert the netdevice.h vs xdp.h dependency
  net: move struct netdev_rx_queue out of netdevice.h
  eth: add missing xdp.h includes in drivers
  selftests/bpf: Add testcase for xdp attaching failure tracepoint
  bpf, xdp: Add tracepoint to xdp attaching failure
  selftests/bpf: fix static assert compilation issue for test_cls_*.c
  bpf: fix bpf_probe_read_kernel prototype mismatch
  riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework
  libbpf: fix typos in Makefile
  tracing: bpf: use struct trace_entry in struct syscall_tp_t
  bpf, devmap: Remove unused dtab field from bpf_dtab_netdev
  bpf, cpumap: Remove unused cmap field from bpf_cpu_map_entry
  netfilter: bpf: Only define get_proto_defrag_hook() if necessary
  bpf: Fix an array-index-out-of-bounds issue in disasm.c
  net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfn
  docs/bpf: Fix malformed documentation
  bpf: selftests: Add defrag selftests
  bpf: selftests: Support custom type and proto for client sockets
  bpf: selftests: Support not connecting client socket
  netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link
  ...
====================

Link: https://lore.kernel.org/r/20230803174845.825419-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-03 15:34:36 -07:00
Jakub Kicinski
680ee0456a net: invert the netdevice.h vs xdp.h dependency
xdp.h is far more specific and is included in only 67 other
files vs netdevice.h's 1538 include sites.
Make xdp.h include netdevice.h, instead of the other way around.
This decreases the incremental allmodconfig builds size when
xdp.h is touched from 5947 to 662 objects.

Move bpf_prog_run_xdp() to xdp.h, seems appropriate and filter.h
is a mega-header in its own right so it's nice to avoid xdp.h
getting included there as well.

The only unfortunate part is that the typedef for xdp_features_t
has to move to netdevice.h, since its embedded in struct netdevice.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-4-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-03 08:38:07 -07:00
Jakub Kicinski
49e47a5b61 net: move struct netdev_rx_queue out of netdevice.h
struct netdev_rx_queue is touched in only a few places
and having it defined in netdevice.h brings in the dependency
on xdp.h, because struct xdp_rxq_info gets embedded in
struct netdev_rx_queue.

In prep for removal of xdp.h from netdevice.h move all
the netdev_rx_queue stuff to a new header.

We could technically break the new header up to avoid
the sysfs.h include but it's so rarely included it
doesn't seem to be worth it at this point.

Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Link: https://lore.kernel.org/r/20230803010230.1755386-3-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-03 08:38:07 -07:00
Mateusz Kowalski
f11e5bd159 bonding: support balance-alb with openvswitch
Commit d5410ac7b0 ("net:bonding:support balance-alb interface with
vlan to bridge") introduced a support for balance-alb mode for
interfaces connected to the linux bridge by fixing missing matching of
MAC entry in FDB. In our testing we discovered that it still does not
work when the bond is connected to the OVS bridge as show in diagram
below:

eth1(mac:eth1_mac)--bond0(balance-alb,mac:eth0_mac)--eth0(mac:eth0_mac)
                         |
                       bond0.150(mac:eth0_mac)
                         |
                       ovs_bridge(ip:bridge_ip,mac:eth0_mac)

This patch fixes it by checking not only if the device is a bridge but
also if it is an openvswitch.

Signed-off-by: Mateusz Kowalski <mko@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/9fe7297c-609e-208b-c77b-3ceef6eb51a4@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-03 10:25:42 +02:00
Vladimir Oltean
fd770e856e net: remove phy_has_hwtstamp() -> phy_mii_ioctl() decision from converted drivers
It is desirable that the new .ndo_hwtstamp_set() API gives more
uniformity, less overhead and future flexibility w.r.t. the PHY
timestamping behavior.

Currently there are some drivers which allow PHY timestamping through
the procedure mentioned in Documentation/networking/timestamping.rst.
They don't do anything locally if phy_has_hwtstamp() is set, except for
lan966x which installs PTP packet traps.

Centralize that behavior in a new dev_set_hwtstamp_phylib() code
function, which calls either phy_mii_ioctl() for the phylib PHY,
or .ndo_hwtstamp_set() of the netdev, based on a single policy
(currently simplistic: phy_has_hwtstamp()).

Any driver converted to .ndo_hwtstamp_set() will automatically opt into
the centralized phylib timestamping policy. Unconverted drivers still
get to choose whether they let the PHY handle timestamping or not.

Netdev drivers with integrated PHY drivers that don't use phylib
presumably don't set dev->phydev, and those will always see
HWTSTAMP_SOURCE_NETDEV requests even when converted. The timestamping
policy will remain 100% up to them.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-13-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02 19:11:06 -07:00
Maxim Georgiev
e47d01fea6 net: add hwtstamping helpers for stackable net devices
The stackable net devices with hwtstamping support (vlan, macvlan,
bonding) only pass the hwtstamping ops to the lower (real) device.

These drivers are the first that need to be converted to the new
timestamping API, because if they aren't prepared to handle that,
then no real device driver cannot be converted to the new API either.

After studying what vlan_dev_ioctl(), macvlan_eth_ioctl() and
bond_eth_ioctl() have in common, here we propose two generic
implementations of ndo_hwtstamp_get() and ndo_hwtstamp_set() which
can be called by those 3 drivers, with "dev" being their lower device.

These helpers cover both cases, when the lower driver is converted to
the new API or unconverted.

We need some hacks in case of an unconverted driver, namely to stuff
some pointers in struct kernel_hwtstamp_config which shouldn't have
been there (since the new API isn't supposed to need it). These will
be removed when all drivers will have been converted to the new API.

Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02 19:11:05 -07:00
Maxim Georgiev
66f7223039 net: add NDOs for configuring hardware timestamping
Current hardware timestamping API for NICs requires implementing
.ndo_eth_ioctl() for SIOCGHWTSTAMP and SIOCSHWTSTAMP.

That API has some boilerplate such as request parameter translation
between user and kernel address spaces, handling possible translation
failures correctly, etc. Since it is the same all across the board, it
would be desirable to handle it through generic code.

Here we introduce .ndo_hwtstamp_get() and .ndo_hwtstamp_set(), which
implement that boilerplate and allow drivers to just act upon requests.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Maxim Georgiev <glipus@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20230801142824.1772134-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-02 19:11:05 -07:00
Jakub Kicinski
84e00d9bd4 net: convert some netlink netdev iterators to depend on the xarray
Reap the benefits of easier iteration thanks to the xarray.
Convert just the genetlink ones, those are easier to test.

Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230726185530.2247698-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28 11:35:58 -07:00
YueHaibing
d0358c1a37 net: Remove unused declaration dev_restart()
This is not used, so can remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230726143715.24700-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-27 17:17:28 -07:00
Maciej Fijalkowski
a097627dca net: add missing net_device::xdp_zc_max_segs description
Cited commit under 'Fixes' tag introduced new member to struct
net_device without providing description of it - fix it.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/all/20230720141613.61488b9e@canb.auug.org.au/
Fixes: 13ce2daa25 ("xsk: add new netlink attribute dedicated for ZC max frags")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Tested-by: Simon Horman <simon.horman@corigine.com> # build-tested
Link: https://lore.kernel.org/r/20230721145808.596298-1-maciej.fijalkowski@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-24 15:54:50 -07:00
Daniel Borkmann
e420bed025 bpf: Add fd-based tcx multi-prog infra with link support
This work refactors and adds a lightweight extension ("tcx") to the tc BPF
ingress and egress data path side for allowing BPF program management based
on fds via bpf() syscall through the newly added generic multi-prog API.
The main goal behind this work which we also presented at LPC [0] last year
and a recent update at LSF/MM/BPF this year [3] is to support long-awaited
BPF link functionality for tc BPF programs, which allows for a model of safe
ownership and program detachment.

Given the rise in tc BPF users in cloud native environments, this becomes
necessary to avoid hard to debug incidents either through stale leftover
programs or 3rd party applications accidentally stepping on each others toes.
As a recap, a BPF link represents the attachment of a BPF program to a BPF
hook point. The BPF link holds a single reference to keep BPF program alive.
Moreover, hook points do not reference a BPF link, only the application's
fd or pinning does. A BPF link holds meta-data specific to attachment and
implements operations for link creation, (atomic) BPF program update,
detachment and introspection. The motivation for BPF links for tc BPF programs
is multi-fold, for example:

  - From Meta: "It's especially important for applications that are deployed
    fleet-wide and that don't "control" hosts they are deployed to. If such
    application crashes and no one notices and does anything about that, BPF
    program will keep running draining resources or even just, say, dropping
    packets. We at FB had outages due to such permanent BPF attachment
    semantics. With fd-based BPF link we are getting a framework, which allows
    safe, auto-detachable behavior by default, unless application explicitly
    opts in by pinning the BPF link." [1]

  - From Cilium-side the tc BPF programs we attach to host-facing veth devices
    and phys devices build the core datapath for Kubernetes Pods, and they
    implement forwarding, load-balancing, policy, EDT-management, etc, within
    BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently
    experienced hard-to-debug issues in a user's staging environment where
    another Kubernetes application using tc BPF attached to the same prio/handle
    of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath
    it. The goal is to establish a clear/safe ownership model via links which
    cannot accidentally be overridden. [0,2]

BPF links for tc can co-exist with non-link attachments, and the semantics are
in line also with XDP links: BPF links cannot replace other BPF links, BPF
links cannot replace non-BPF links, non-BPF links cannot replace BPF links and
lastly only non-BPF links can replace non-BPF links. In case of Cilium, this
would solve mentioned issue of safe ownership model as 3rd party applications
would not be able to accidentally wipe Cilium programs, even if they are not
BPF link aware.

Earlier attempts [4] have tried to integrate BPF links into core tc machinery
to solve cls_bpf, which has been intrusive to the generic tc kernel API with
extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could
be wiped from the qdisc also. Locking a tc BPF program in place this way, is
getting into layering hacks given the two object models are vastly different.

We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF
attach API, so that the BPF link implementation blends in naturally similar to
other link types which are fd-based and without the need for changing core tc
internal APIs. BPF programs for tc can then be successively migrated from classic
cls_bpf to the new tc BPF link without needing to change the program's source
code, just the BPF loader mechanics for attaching is sufficient.

For the current tc framework, there is no change in behavior with this change
and neither does this change touch on tc core kernel APIs. The gist of this
patch is that the ingress and egress hook have a lightweight, qdisc-less
extension for BPF to attach its tc BPF programs, in other words, a minimal
entry point for tc BPF. The name tcx has been suggested from discussion of
earlier revisions of this work as a good fit, and to more easily differ between
the classic cls_bpf attachment and the fd-based one.

For the ingress and egress tcx points, the device holds a cache-friendly array
with program pointers which is separated from control plane (slow-path) data.
Earlier versions of this work used priority to determine ordering and expression
of dependencies similar as with classic tc, but it was challenged that for
something more future-proof a better user experience is required. Hence this
resulted in the design and development of the generic attach/detach/query API
for multi-progs. See prior patch with its discussion on the API design. tcx is
the first user and later we plan to integrate also others, for example, one
candidate is multi-prog support for XDP which would benefit and have the same
'look and feel' from API perspective.

The goal with tcx is to have maximum compatibility to existing tc BPF programs,
so they don't need to be rewritten specifically. Compatibility to call into
classic tcf_classify() is also provided in order to allow successive migration
or both to cleanly co-exist where needed given its all one logical tc layer and
the tcx plus classic tc cls/act build one logical overall processing pipeline.

tcx supports the simplified return codes TCX_NEXT which is non-terminating (go
to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT.
The fd-based API is behind a static key, so that when unused the code is also
not entered. The struct tcx_entry's program array is currently static, but
could be made dynamic if necessary at a point in future. The a/b pair swap
design has been chosen so that for detachment there are no allocations which
otherwise could fail.

The work has been tested with tc-testing selftest suite which all passes, as
well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB.

Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews
of this work.

  [0] https://lpc.events/event/16/contributions/1353/
  [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com
  [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog
  [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf
  [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 10:07:27 -07:00
Maciej Fijalkowski
13ce2daa25 xsk: add new netlink attribute dedicated for ZC max frags
Introduce new netlink attribute NETDEV_A_DEV_XDP_ZC_MAX_SEGS that will
carry maximum fragments that underlying ZC driver is able to handle on
TX side. It is going to be included in netlink response only when driver
supports ZC. Any value higher than 1 implies multi-buffer ZC support on
underlying device.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20230719132421.584801-11-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19 09:56:49 -07:00
Jakub Kicinski
a685d0df75 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZJX+ygAKCRDbK58LschI
 g0/2AQDHg12smf9mPfK9wOFDNRIIX8r2iufB8LUFQMzCwltN6gEAkAdkAyfbof7P
 TMaNUiHABijAFtChxoSI35j3OOSRrwE=
 =GJgN
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-06-23

We've added 49 non-merge commits during the last 24 day(s) which contain
a total of 70 files changed, 1935 insertions(+), 442 deletions(-).

The main changes are:

1) Extend bpf_fib_lookup helper to allow passing the route table ID,
   from Louis DeLosSantos.

2) Fix regsafe() in verifier to call check_ids() for scalar registers,
   from Eduard Zingerman.

3) Extend the set of cpumask kfuncs with bpf_cpumask_first_and()
   and a rework of bpf_cpumask_any*() kfuncs. Additionally,
   add selftests, from David Vernet.

4) Fix socket lookup BPF helpers for tc/XDP to respect VRF bindings,
   from Gilad Sever.

5) Change bpf_link_put() to use workqueue unconditionally to fix it
   under PREEMPT_RT, from Sebastian Andrzej Siewior.

6) Follow-ups to address issues in the bpf_refcount shared ownership
   implementation, from Dave Marchevsky.

7) A few general refactorings to BPF map and program creation permissions
   checks which were part of the BPF token series, from Andrii Nakryiko.

8) Various fixes for benchmark framework and add a new benchmark
   for BPF memory allocator to BPF selftests, from Hou Tao.

9) Documentation improvements around iterators and trusted pointers,
   from Anton Protopopov.

10) Small cleanup in verifier to improve allocated object check,
    from Daniel T. Lee.

11) Improve performance of bpf_xdp_pointer() by avoiding access
    to shared_info when XDP packet does not have frags,
    from Jesper Dangaard Brouer.

12) Silence a harmless syzbot-reported warning in btf_type_id_size(),
    from Yonghong Song.

13) Remove duplicate bpfilter_umh_cleanup in favor of umd_cleanup_helper,
    from Jarkko Sakkinen.

14) Fix BPF selftests build for resolve_btfids under custom HOSTCFLAGS,
    from Viktor Malik.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits)
  bpf, docs: Document existing macros instead of deprecated
  bpf, docs: BPF Iterator Document
  selftests/bpf: Fix compilation failure for prog vrf_socket_lookup
  selftests/bpf: Add vrf_socket_lookup tests
  bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindings
  bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC hookpoint
  bpf: Factor out socket lookup functions for the TC hookpoint.
  selftests/bpf: Set the default value of consumer_cnt as 0
  selftests/bpf: Ensure that next_cpu() returns a valid CPU number
  selftests/bpf: Output the correct error code for pthread APIs
  selftests/bpf: Use producer_cnt to allocate local counter array
  xsk: Remove unused inline function xsk_buff_discard()
  bpf: Keep BPF_PROG_LOAD permission checks clear of validations
  bpf: Centralize permissions checks for all BPF map types
  bpf: Inline map creation logic in map_create() function
  bpf: Move unprivileged checks into map_create() and bpf_prog_load()
  bpf: Remove in_atomic() from bpf_link_put().
  selftests/bpf: Verify that check_ids() is used for scalars in regsafe()
  bpf: Verify scalar ids mapping in regsafe() using check_ids()
  selftests/bpf: Check if mark_chain_precision() follows scalar ids
  ...
====================

Link: https://lore.kernel.org/r/20230623211256.8409-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24 14:52:28 -07:00
Gilad Sever
9a5cb79762 bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindings
When calling bpf_sk_lookup_tcp(), bpf_sk_lookup_udp() or
bpf_skc_lookup_tcp() from tc/xdp ingress, VRF socket bindings aren't
respoected, i.e. unbound sockets are returned, and bound sockets aren't
found.

VRF binding is determined by the sdif argument to sk_lookup(), however
when called from tc the IP SKB control block isn't initialized and thus
inet{,6}_sdif() always returns 0.

Fix by calculating sdif for the tc/xdp flows by observing the device's
l3 enslaved state.

The cg/sk_skb hooking points which are expected to support
inet{,6}_sdif() pass sdif=-1 which makes __bpf_skc_lookup() use the
existing logic.

Fixes: 6acc9b432e ("bpf: Add helper to retrieve socket in BPF")
Signed-off-by: Gilad Sever <gilad9366@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/bpf/20230621104211.301902-4-gilad9366@gmail.com
2023-06-21 23:48:41 +02:00
Jakub Kicinski
70f7457ad6 net: create device lookup API with reference tracking
New users of dev_get_by_index() and dev_get_by_name() keep
getting added and it would be nice to steer them towards
the APIs with reference tracking.

Add variants of those calls which allocate the reference
tracker and use them in a couple of places.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-15 08:21:11 +01:00
Eric Dumazet
d457a0e329 net: move gso declarations and functions to their own files
Move declarations into include/net/gso.h and code into net/core/gso.c

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-10 00:11:41 -07:00
Eric Dumazet
d636fc5dd6 net: sched: add rcu annotations around qdisc->qdisc_sleeping
syzbot reported a race around qdisc->qdisc_sleeping [1]

It is time we add proper annotations to reads and writes to/from
qdisc->qdisc_sleeping.

[1]
BUG: KCSAN: data-race in dev_graft_qdisc / qdisc_lookup_rcu

read to 0xffff8881286fc618 of 8 bytes by task 6928 on cpu 1:
qdisc_lookup_rcu+0x192/0x2c0 net/sched/sch_api.c:331
__tcf_qdisc_find+0x74/0x3c0 net/sched/cls_api.c:1174
tc_get_tfilter+0x18f/0x990 net/sched/cls_api.c:2547
rtnetlink_rcv_msg+0x7af/0x8c0 net/core/rtnetlink.c:6386
netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546
rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
____sys_sendmsg+0x375/0x4c0 net/socket.c:2503
___sys_sendmsg net/socket.c:2557 [inline]
__sys_sendmsg+0x1e3/0x270 net/socket.c:2586
__do_sys_sendmsg net/socket.c:2595 [inline]
__se_sys_sendmsg net/socket.c:2593 [inline]
__x64_sys_sendmsg+0x46/0x50 net/socket.c:2593
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

write to 0xffff8881286fc618 of 8 bytes by task 6912 on cpu 0:
dev_graft_qdisc+0x4f/0x80 net/sched/sch_generic.c:1115
qdisc_graft+0x7d0/0xb60 net/sched/sch_api.c:1103
tc_modify_qdisc+0x712/0xf10 net/sched/sch_api.c:1693
rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6395
netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546
rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365
netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
____sys_sendmsg+0x375/0x4c0 net/socket.c:2503
___sys_sendmsg net/socket.c:2557 [inline]
__sys_sendmsg+0x1e3/0x270 net/socket.c:2586
__do_sys_sendmsg net/socket.c:2595 [inline]
__se_sys_sendmsg net/socket.c:2593 [inline]
__x64_sys_sendmsg+0x46/0x50 net/socket.c:2593
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 6912 Comm: syz-executor.5 Not tainted 6.4.0-rc3-syzkaller-00190-g0d85b27b0cc6 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/16/2023

Fixes: 3a7d0d07a3 ("net: sched: extend Qdisc with rcu")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Vlad Buslov <vladbu@nvidia.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-07 10:25:39 +01:00
Eric Dumazet
5c3b74a92a rfs: annotate lockless accesses to RFS sock flow table
Add READ_ONCE()/WRITE_ONCE() on accesses to the sock flow table.

This also prevents a (smart ?) compiler to remove the condition in:

if (table->ents[index] != newval)
        table->ents[index] = newval;

We need the condition to avoid dirtying a shared cache line.

Fixes: fec5e652e5 ("rfs: Receive Flow Steering")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-07 10:08:45 +01:00
Eric Dumazet
87eff2ec57 net: optimize napi_threaded_poll() vs RPS/RFS
We use napi_threaded_poll() in order to reduce our softirq dependency.

We can add a followup of 821eba962d ("net: optimize napi_schedule_rps()")
to further remove the need of firing NET_RX_SOFTIRQ whenever
RPS/RFS are used.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-23 13:35:07 +01:00
Johannes Berg
5b8285cca6 net: move dropreason.h to dropreason-core.h
This will, after the next patch, hold only the core
drop reasons and minimal infrastructure. Fix a small
kernel-doc issue while at it, to avoid the move
triggering a checker.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-20 20:20:49 -07:00
Jakub Kicinski
8c48eea3ad page_pool: allow caching from safely localized NAPI
Recent patches to mlx5 mentioned a regression when moving from
driver local page pool to only using the generic page pool code.
Page pool has two recycling paths (1) direct one, which runs in
safe NAPI context (basically consumer context, so producing
can be lockless); and (2) via a ptr_ring, which takes a spin
lock because the freeing can happen from any CPU; producer
and consumer may run concurrently.

Since the page pool code was added, Eric introduced a revised version
of deferred skb freeing. TCP skbs are now usually returned to the CPU
which allocated them, and freed in softirq context. This places the
freeing (producing of pages back to the pool) enticingly close to
the allocation (consumer).

If we can prove that we're freeing in the same softirq context in which
the consumer NAPI will run - lockless use of the cache is perfectly fine,
no need for the lock.

Let drivers link the page pool to a NAPI instance. If the NAPI instance
is scheduled on the same CPU on which we're freeing - place the pages
in the direct cache.

With that and patched bnxt (XDP enabled to engage the page pool, sigh,
bnxt really needs page pool work :() I see a 2.6% perf boost with
a TCP stream test (app on a different physical core than softirq).

The CPU use of relevant functions decreases as expected:

  page_pool_refill_alloc_cache   1.17% -> 0%
  _raw_spin_lock                 2.41% -> 0.98%

Only consider lockless path to be safe when NAPI is scheduled
- in practice this should cover majority if not all of steady state
workloads. It's usually the NAPI kicking in that causes the skb flush.

The main case we'll miss out on is when application runs on the same
CPU as NAPI. In that case we don't use the deferred skb free path.

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-14 18:56:12 -07:00
Jakub Kicinski
800e68c44f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

tools/testing/selftests/net/config
  62199e3f16 ("selftests: net: Add VXLAN MDB test")
  3a0385be13 ("selftests: add the missing CONFIG_IP_SCTP in net config")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-13 16:04:28 -07:00
Jesper Dangaard Brouer
0cd917a4a8 xdp: rss hash types representation
The RSS hash type specifies what portion of packet data NIC hardware used
when calculating RSS hash value. The RSS types are focused on Internet
traffic protocols at OSI layers L3 and L4. L2 (e.g. ARP) often get hash
value zero and no RSS type. For L3 focused on IPv4 vs. IPv6, and L4
primarily TCP vs UDP, but some hardware supports SCTP.

Hardware RSS types are differently encoded for each hardware NIC. Most
hardware represent RSS hash type as a number. Determining L3 vs L4 often
requires a mapping table as there often isn't a pattern or sorting
according to ISO layer.

The patch introduce a XDP RSS hash type (enum xdp_rss_hash_type) that
contains both BITs for the L3/L4 types, and combinations to be used by
drivers for their mapping tables. The enum xdp_rss_type_bits get exposed
to BPF via BTF, and it is up to the BPF-programmer to match using these
defines.

This proposal change the kfunc API bpf_xdp_metadata_rx_hash() adding
a pointer value argument for provide the RSS hash type.
Change signature for all xmo_rx_hash calls in drivers to make it compile.

The RSS type implementations for each driver comes as separate patches.

Fixes: 3d76a4d3d4 ("bpf: XDP metadata RX kfuncs")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132892042.340624.582563003880565460.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13 11:15:10 -07:00
Jakub Kicinski
301f227fc8 net: piggy back on the memory barrier in bql when waking queues
Drivers call netdev_tx_completed_queue() right before
netif_txq_maybe_wake(). If BQL is enabled netdev_tx_completed_queue()
should issue a memory barrier, so we can depend on that separating
the stop check from the consumer index update, instead of adding
another barrier in netif_txq_maybe_wake().

This matters more than the barriers on the xmit path, because
the wake condition is almost always true. So we issue the
consumer side barrier often.

Wrap netdev_tx_completed_queue() in a local helper to issue
the barrier even if BQL is disabled. Keep the same semantics
as netdev_tx_completed_queue() (barrier only if bytes != 0)
to make it clear that the barrier is conditional.

Plus since macro gets pkt/byte counts as arguments now -
we can skip waking if there were no packets completed.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-10 17:56:18 -07:00
Jakub Kicinski
c91c46de6b net: provide macros for commonly copied lockless queue stop/wake code
A lot of drivers follow the same scheme to stop / start queues
without introducing locks between xmit and NAPI tx completions.
I'm guessing they all copy'n'paste each other's code.
The original code dates back all the way to e1000 and Linux 2.6.19.

Smaller drivers shy away from the scheme and introduce a lock
which may cause deadlocks in netpoll.

Provide macros which encapsulate the necessary logic.

The macros do not prevent false wake ups, the extra barrier
required to close that race is not worth it. See discussion in:
https://lore.kernel.org/all/c39312a2-4537-14b4-270c-9fe1fbb91e89@gmail.com/

Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-10 17:56:18 -07:00
Vladimir Oltean
5a17818682 net: dsa: replace NETDEV_PRE_CHANGE_HWTSTAMP notifier with a stub
There was a sort of rush surrounding commit 88c0a6b503 ("net: create a
netdev notifier for DSA to reject PTP on DSA master"), due to a desire
to convert DSA's attempt to deny TX timestamping on a DSA master to
something that doesn't block the kernel-wide API conversion from
ndo_eth_ioctl() to ndo_hwtstamp_set().

What was required was a mechanism that did not depend on ndo_eth_ioctl(),
and what was provided was a mechanism that did not depend on
ndo_eth_ioctl(), while at the same time introducing something that
wasn't absolutely necessary - a new netdev notifier.

There have been objections from Jakub Kicinski that using notifiers in
general when they are not absolutely necessary creates complications to
the control flow and difficulties to maintainers who look at the code.
So there is a desire to not use notifiers.

In addition to that, the notifier chain gets called even if there is no
DSA in the system and no one is interested in applying any restriction.

Take the model of udp_tunnel_nic_ops and introduce a stub mechanism,
through which net/core/dev_ioctl.c can call into DSA even when
CONFIG_NET_DSA=m.

Compared to the code that existed prior to the notifier conversion, aka
what was added in commits:
- 4cfab35667 ("net: dsa: Add wrappers for overloaded ndo_ops")
- 3369afba1e ("net: Call into DSA netdevice_ops wrappers")

this is different because we are not overloading any struct
net_device_ops of the DSA master anymore, but rather, we are exposing a
rather specific functionality which is orthogonal to which API is used
to enable it - ndo_eth_ioctl() or ndo_hwtstamp_set().

Also, what is similar is that both approaches use function pointers to
get from built-in code to DSA.

There is no point in replicating the function pointers towards
__dsa_master_hwtstamp_validate() once for every CPU port (dev->dsa_ptr).
Instead, it is sufficient to introduce a singleton struct dsa_stubs,
built into the kernel, which contains a single function pointer to
__dsa_master_hwtstamp_validate().

I find this approach preferable to what we had originally, because
dev->dsa_ptr->netdev_ops->ndo_do_ioctl() used to require going through
struct dsa_port (dev->dsa_ptr), and so, this was incompatible with any
attempts to add any data encapsulation and hide DSA data structures from
the outside world.

Link: https://lore.kernel.org/netdev/20230403083019.120b72fd@kernel.org/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-09 15:35:49 +01:00
Vladimir Oltean
88c0a6b503 net: create a netdev notifier for DSA to reject PTP on DSA master
The fact that PTP 2-step TX timestamping is broken on DSA switches if
the master also timestamps the same packets is documented by commit
f685e609a3 ("net: dsa: Deny PTP on master if switch supports it").
We attempt to help the users avoid shooting themselves in the foot by
making DSA reject the timestamping ioctls on an interface that is a DSA
master, and the switch tree beneath it contains switches which are aware
of PTP.

The only problem is that there isn't an established way of intercepting
ndo_eth_ioctl calls, so DSA creates avoidable burden upon the network
stack by creating a struct dsa_netdevice_ops with overlaid function
pointers that are manually checked from the relevant call sites. There
used to be 2 such dsa_netdevice_ops, but now, ndo_eth_ioctl is the only
one left.

There is an ongoing effort to migrate driver-visible hardware timestamping
control from the ndo_eth_ioctl() based API to a new ndo_hwtstamp_set()
model, but DSA actively prevents that migration, since dsa_master_ioctl()
is currently coded to manually call the master's legacy ndo_eth_ioctl(),
and so, whenever a network device driver would be converted to the new
API, DSA's restrictions would be circumvented, because any device could
be used as a DSA master.

The established way for unrelated modules to react on a net device event
is via netdevice notifiers. So we create a new notifier which gets
called whenever there is an attempt to change hardware timestamping
settings on a device.

Finally, there is another reason why a netdev notifier will be a good
idea, besides strictly DSA, and this has to do with PHY timestamping.

With ndo_eth_ioctl(), all MAC drivers must manually call
phy_has_hwtstamp() before deciding whether to act upon SIOCSHWTSTAMP,
otherwise they must pass this ioctl to the PHY driver via
phy_mii_ioctl().

With the new ndo_hwtstamp_set() API, it will be desirable to simply not
make any calls into the MAC device driver when timestamping should be
performed at the PHY level.

But there exist drivers, such as the lan966x switch, which need to
install packet traps for PTP regardless of whether they are the layer
that provides the hardware timestamps, or the PHY is. That would be
impossible to support with the new API.

The proposal there, too, is to introduce a netdev notifier which acts as
a better cue for switching drivers to add or remove PTP packet traps,
than ndo_hwtstamp_set(). The one introduced here "almost" works there as
well, except for the fact that packet traps should only be installed if
the PHY driver succeeded to enable hardware timestamping, whereas here,
we need to deny hardware timestamping on the DSA master before it
actually gets enabled. This is why this notifier is called "PRE_", and
the notifier that would get used for PHY timestamping and packet traps
would be called NETDEV_CHANGE_HWTSTAMP. This isn't a new concept, for
example NETDEV_CHANGEUPPER and NETDEV_PRECHANGEUPPER do the same thing.

In expectation of future netlink UAPI, we also pass a non-NULL extack
pointer to the netdev notifier, and we make DSA populate it with an
informative reason for the rejection. To avoid making it go to waste, we
make the ioctl-based dev_set_hwtstamp() create a fake extack and print
the message to the kernel log.

Link: https://lore.kernel.org/netdev/20230401191215.tvveoi3lkawgg6g4@skbuf/
Link: https://lore.kernel.org/netdev/20230310164451.ls7bbs6pdzs4m6pw@skbuf/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-03 10:04:27 +01:00
Jakub Kicinski
dd2d660440 net: minor reshuffle of napi_struct
napi_id is read by GRO and drivers to mark skbs, and it currently
sits at the end of the structure, in a mostly unused cache line.
Move it up into a hole, and separate the clearly control path
fields from the important ones.

Before:

struct napi_struct {
	struct list_head           poll_list;            /*     0    16 */
	long unsigned int          state;                /*    16     8 */
	int                        weight;               /*    24     4 */
	int                        defer_hard_irqs_count; /*    28     4 */
	long unsigned int          gro_bitmask;          /*    32     8 */
	int                        (*poll)(struct napi_struct *, int); /*    40     8 */
	int                        poll_owner;           /*    48     4 */

	/* XXX 4 bytes hole, try to pack */

	struct net_device *        dev;                  /*    56     8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct gro_list            gro_hash[8];          /*    64   192 */
	/* --- cacheline 4 boundary (256 bytes) --- */
	struct sk_buff *           skb;                  /*   256     8 */
	struct list_head           rx_list;              /*   264    16 */
	int                        rx_count;             /*   280     4 */

	/* XXX 4 bytes hole, try to pack */

	struct hrtimer             timer;                /*   288    64 */

	/* XXX last struct has 4 bytes of padding */

	/* --- cacheline 5 boundary (320 bytes) was 32 bytes ago --- */
	struct list_head           dev_list;             /*   352    16 */
	struct hlist_node          napi_hash_node;       /*   368    16 */
	/* --- cacheline 6 boundary (384 bytes) --- */
	unsigned int               napi_id;              /*   384     4 */

	/* XXX 4 bytes hole, try to pack */

	struct task_struct *       thread;               /*   392     8 */

	/* size: 400, cachelines: 7, members: 17 */
	/* sum members: 388, holes: 3, sum holes: 12 */
	/* paddings: 1, sum paddings: 4 */
	/* last cacheline: 16 bytes */
};

After:

struct napi_struct {
	struct list_head           poll_list;            /*     0    16 */
	long unsigned int          state;                /*    16     8 */
	int                        weight;               /*    24     4 */
	int                        defer_hard_irqs_count; /*    28     4 */
	long unsigned int          gro_bitmask;          /*    32     8 */
	int                        (*poll)(struct napi_struct *, int); /*    40     8 */
	int                        poll_owner;           /*    48     4 */

	/* XXX 4 bytes hole, try to pack */

	struct net_device *        dev;                  /*    56     8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct gro_list            gro_hash[8];          /*    64   192 */
	/* --- cacheline 4 boundary (256 bytes) --- */
	struct sk_buff *           skb;                  /*   256     8 */
	struct list_head           rx_list;              /*   264    16 */
	int                        rx_count;             /*   280     4 */
	unsigned int               napi_id;              /*   284     4 */
	struct hrtimer             timer;                /*   288    64 */

	/* XXX last struct has 4 bytes of padding */

	/* --- cacheline 5 boundary (320 bytes) was 32 bytes ago --- */
	struct task_struct *       thread;               /*   352     8 */
	struct list_head           dev_list;             /*   360    16 */
	struct hlist_node          napi_hash_node;       /*   376    16 */

	/* size: 392, cachelines: 7, members: 17 */
	/* sum members: 388, holes: 1, sum holes: 4 */
	/* paddings: 1, sum paddings: 4 */
	/* forced alignments: 1 */
	/* last cacheline: 8 bytes */
} __attribute__((__aligned__(8)));

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-02 13:27:17 +01:00
Eric Dumazet
c59647c0dc net: add softnet_data.in_net_rx_action
We want to make two optimizations in napi_schedule_rps() and
____napi_schedule() which require to know if these helpers are
called from net_rx_action(), instead of being called from
other contexts.

sd.in_net_rx_action is only read/written by the owning cpu.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Tested-by: Jason Xing <kerneljasonxing@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-03-30 13:40:00 +02:00
Jakub Kicinski
3eb8eea2a4 docs: networking: document NAPI
Add basic documentation about NAPI. We can stop linking to the ancient
doc on the LF wiki.

Link: https://lore.kernel.org/all/20230315223044.471002-1-kuba@kernel.org/
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Pavel Pisa <pisa@cmp.felk.cvut.cz> # for ctucanfd-driver.rst
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20230322053848.198452-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-23 19:47:40 -07:00
Nick Child
1cc6571f56 netdev: Enforce index cap in netdev_get_tx_queue
When requesting a TX queue at a given index, warn on out-of-bounds
referencing if the index is greater than the allocated number of
queues.

Specifically, since this function is used heavily in the networking
stack use DEBUG_NET_WARN_ON_ONCE to avoid executing a new branch on
every packet.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
Link: https://lore.kernel.org/r/20230321150725.127229-2-nnac123@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-22 22:38:25 -07:00
Jakub Kicinski
1118aa4c70 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
net/wireless/nl80211.c
  b27f07c50a ("wifi: nl80211: fix puncturing bitmap policy")
  cbbaf2bb82 ("wifi: nl80211: add a command to enable/disable HW timestamping")
https://lore.kernel.org/all/20230314105421.3608efae@canb.auug.org.au

tools/testing/selftests/net/Makefile
  62199e3f16 ("selftests: net: Add VXLAN MDB test")
  13715acf8a ("selftest: Add test for bind() conflicts.")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-17 16:29:25 -07:00
Ido Schimmel
8c44fa12c8 net: Add MDB net device operations
Add MDB net device operations that will be invoked by rtnetlink code in
response to received RTM_{NEW,DEL,GET}MDB messages. Subsequent patches
will implement these operations in the bridge and VXLAN drivers.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-17 08:05:48 +00:00
Eric Dumazet
4b397c06cb net: tunnels: annotate lockless accesses to dev->needed_headroom
IP tunnels can apparently update dev->needed_headroom
in their xmit path.

This patch takes care of three tunnels xmit, and also the
core LL_RESERVED_SPACE() and LL_RESERVED_SPACE_EXTRA()
helpers.

More changes might be needed for completeness.

BUG: KCSAN: data-race in ip_tunnel_xmit / ip_tunnel_xmit

read to 0xffff88815b9da0ec of 2 bytes by task 888 on cpu 1:
ip_tunnel_xmit+0x1270/0x1730 net/ipv4/ip_tunnel.c:803
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3051 [inline]
neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623
neigh_output include/net/neighbour.h:546 [inline]
ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228
ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430
dst_output include/net/dst.h:444 [inline]
ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126
iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3051 [inline]
neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623
neigh_output include/net/neighbour.h:546 [inline]
ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228
ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430
dst_output include/net/dst.h:444 [inline]
ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126
iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3051 [inline]
neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623
neigh_output include/net/neighbour.h:546 [inline]
ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228
ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430
dst_output include/net/dst.h:444 [inline]
ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126
iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3051 [inline]
neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623
neigh_output include/net/neighbour.h:546 [inline]
ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228
ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430
dst_output include/net/dst.h:444 [inline]
ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126
iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3051 [inline]
neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623
neigh_output include/net/neighbour.h:546 [inline]
ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228
ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430
dst_output include/net/dst.h:444 [inline]
ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126
iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3051 [inline]
neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623
neigh_output include/net/neighbour.h:546 [inline]
ip_finish_output2+0x740/0x840 net/ipv4/ip_output.c:228
ip_finish_output+0xf4/0x240 net/ipv4/ip_output.c:316
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip_output+0xe5/0x1b0 net/ipv4/ip_output.c:430
dst_output include/net/dst.h:444 [inline]
ip_local_out+0x64/0x80 net/ipv4/ip_output.c:126
iptunnel_xmit+0x34a/0x4b0 net/ipv4/ip_tunnel_core.c:82
ip_tunnel_xmit+0x1451/0x1730 net/ipv4/ip_tunnel.c:813
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246

write to 0xffff88815b9da0ec of 2 bytes by task 2379 on cpu 0:
ip_tunnel_xmit+0x1294/0x1730 net/ipv4/ip_tunnel.c:804
__gre_xmit net/ipv4/ip_gre.c:469 [inline]
ipgre_xmit+0x516/0x570 net/ipv4/ip_gre.c:661
__netdev_start_xmit include/linux/netdevice.h:4881 [inline]
netdev_start_xmit include/linux/netdevice.h:4895 [inline]
xmit_one net/core/dev.c:3580 [inline]
dev_hard_start_xmit+0x127/0x400 net/core/dev.c:3596
__dev_queue_xmit+0x1007/0x1eb0 net/core/dev.c:4246
dev_queue_xmit include/linux/netdevice.h:3051 [inline]
neigh_direct_output+0x17/0x20 net/core/neighbour.c:1623
neigh_output include/net/neighbour.h:546 [inline]
ip6_finish_output2+0x9bc/0xc50 net/ipv6/ip6_output.c:134
__ip6_finish_output net/ipv6/ip6_output.c:195 [inline]
ip6_finish_output+0x39a/0x4e0 net/ipv6/ip6_output.c:206
NF_HOOK_COND include/linux/netfilter.h:291 [inline]
ip6_output+0xeb/0x220 net/ipv6/ip6_output.c:227
dst_output include/net/dst.h:444 [inline]
NF_HOOK include/linux/netfilter.h:302 [inline]
mld_sendpack+0x438/0x6a0 net/ipv6/mcast.c:1820
mld_send_cr net/ipv6/mcast.c:2121 [inline]
mld_ifc_work+0x519/0x7b0 net/ipv6/mcast.c:2653
process_one_work+0x3e6/0x750 kernel/workqueue.c:2390
worker_thread+0x5f2/0xa10 kernel/workqueue.c:2537
kthread+0x1ac/0x1e0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

value changed: 0x0dd4 -> 0x0e14

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 2379 Comm: kworker/0:0 Not tainted 6.3.0-rc1-syzkaller-00002-g8ca09d5fa354-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/02/2023
Workqueue: mld mld_ifc_work

Fixes: 8eb30be035 ("ipv6: Create ip6_tnl_xmit")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230310191109.2384387-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-15 00:04:04 -07:00
Eric Dumazet
40bbae583e net: remove enum skb_free_reason
enum skb_drop_reason is more generic, we can adopt it instead.

Provide dev_kfree_skb_irq_reason() and dev_kfree_skb_any_reason().

This means drivers can use more precise drop reasons if they want to.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>
Link: https://lore.kernel.org/r/20230306204313.10492-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-03-07 23:57:19 -08:00
Paolo Abeni
50bcfe8df7 net: make default_rps_mask a per netns attribute
That really was meant to be a per netns attribute from the beginning.

The idea is that once proper isolation is in place in the main
namespace, additional demux in the child namespaces will be redundant.
Let's make child netns default rps mask empty by default.

To avoid bloating the netns with a possibly large cpumask, allocate
it on-demand during the first write operation.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-20 11:22:54 +00:00
David S. Miller
675f176b4d Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net
Some of the devlink bits were tricky, but I think I got it right.

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-17 11:06:39 +00:00
Ido Schimmel
b20b8aec6f devlink: Fix netdev notifier chain corruption
Cited commit changed devlink to register its netdev notifier block on
the global netdev notifier chain instead of on the per network namespace
one.

However, when changing the network namespace of the devlink instance,
devlink still tries to unregister its notifier block from the chain of
the old namespace and register it on the chain of the new namespace.
This results in corruption of the notifier chains, as the same notifier
block is registered on two different chains: The global one and the per
network namespace one. In turn, this causes other problems such as the
inability to dismantle namespaces due to netdev reference count issues.

Fix by preventing devlink from moving its notifier block between
namespaces.

Reproducer:

 # echo "10 1" > /sys/bus/netdevsim/new_device
 # ip netns add test123
 # devlink dev reload netdevsim/netdevsim10 netns test123
 # ip netns del test123
 [   71.935619] unregister_netdevice: waiting for lo to become free. Usage count = 2
 [   71.938348] leaked reference.

Fixes: 565b4824c3 ("devlink: change port event netdev notifier from per-net to global")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230215073139.1360108-1-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-16 11:53:47 +01:00
Jakub Kicinski
de42873367 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY+bZrwAKCRDbK58LschI
 gzi4AP4+TYo0jnSwwkrOoN9l4f5VO9X8osmj3CXfHBv7BGWVxAD/WnvA3TDZyaUd
 agIZTkRs6BHF9He8oROypARZxTeMLwM=
 =nO1C
 -----END PGP SIGNATURE-----

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-02-11

We've added 96 non-merge commits during the last 14 day(s) which contain
a total of 152 files changed, 4884 insertions(+), 962 deletions(-).

There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c
between commit 5b246e533d ("ice: split probe into smaller functions")
from the net-next tree and commit 66c0e13ad2 ("drivers: net: turn on
XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev()
is otherwise there a 2nd time, and add XDP features to the existing
ice_cfg_netdev() one:

        [...]
        ice_set_netdev_features(netdev);
        netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
                               NETDEV_XDP_ACT_XSK_ZEROCOPY;
        ice_set_ops(netdev);
        [...]

Stephen's merge conflict mail:
https://lore.kernel.org/bpf/20230207101951.21a114fa@canb.auug.org.au/

The main changes are:

1) Add support for BPF trampoline on s390x which finally allows to remove many
   test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich.

2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski.

3) Add capability to export the XDP features supported by the NIC.
   Along with that, add a XDP compliance test tool,
   from Lorenzo Bianconi & Marek Majtyka.

4) Add __bpf_kfunc tag for marking kernel functions as kfuncs,
   from David Vernet.

5) Add a deep dive documentation about the verifier's register
   liveness tracking algorithm, from Eduard Zingerman.

6) Fix and follow-up cleanups for resolve_btfids to be compiled
   as a host program to avoid cross compile issues,
   from Jiri Olsa & Ian Rogers.

7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted
   when testing on different NICs, from Jesper Dangaard Brouer.

8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang.

9) Extend libbpf to add an option for when the perf buffer should
   wake up, from Jon Doron.

10) Follow-up fix on xdp_metadata selftest to just consume on TX
    completion, from Stanislav Fomichev.

11) Extend the kfuncs.rst document with description on kfunc
    lifecycle & stability expectations, from David Vernet.

12) Fix bpftool prog profile to skip attaching to offline CPUs,
    from Tonghao Zhang.

====================

Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-10 17:51:27 -08:00
Paolo Abeni
605cfa1b10 net: introduce default_rps_mask netns attribute
If RPS is enabled, this allows configuring a default rps
mask, which is effective since receive queue creation time.

A default RPS mask allows the system admin to ensure proper
isolation, avoiding races at network namespace or device
creation time.

The default RPS mask is initially empty, and can be
modified via a newly added sysctl entry.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09 17:45:55 -08:00