Small cycle this time:
- Minor driver updates for hfi1, cxgb4, erdma, hns, irdma, mlx5, siw, mana
- inline CQE support for hns
- Have mlx5 display device error codes
- Pinned DMABUF support for irdma
- Continued rxe cleanups, particularly converting the MRs to use xarray
- Improvements to what can be cached in the mlx5 mkey cache
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCY/gPmgAKCRCFwuHvBreF
YW5IAP4xOAiTif4f87vD1twRU/ebq4VEX0r+C2NX5x5fwlCJrAEA7RLV8uG9Uii2
ez0BuWNxfajuvFHntnZ1E+7UDP0S8gk=
=CgUH
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Quite a small cycle this time, even with the rc8. I suppose everyone
went to sleep over xmas.
- Minor driver updates for hfi1, cxgb4, erdma, hns, irdma, mlx5, siw,
mana
- inline CQE support for hns
- Have mlx5 display device error codes
- Pinned DMABUF support for irdma
- Continued rxe cleanups, particularly converting the MRs to use
xarray
- Improvements to what can be cached in the mlx5 mkey cache"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (61 commits)
IB/mlx5: Extend debug control for CC parameters
IB/hfi1: Fix sdma.h tx->num_descs off-by-one errors
IB/hfi1: Fix math bugs in hfi1_can_pin_pages()
RDMA/irdma: Add support for dmabuf pin memory regions
RDMA/mlx5: Use query_special_contexts for mkeys
net/mlx5e: Use query_special_contexts for mkeys
net/mlx5: Change define name for 0x100 lkey value
net/mlx5: Expose bits for querying special mkeys
RDMA/rxe: Fix missing memory barriers in rxe_queue.h
RDMA/mana_ib: Fix a bug when the PF indicates more entries for registering memory on first packet
RDMA/rxe: Remove rxe_alloc()
RDMA/cma: Distinguish between sockaddr_in and sockaddr_in6 by size
Subject: RDMA/rxe: Handle zero length rdma
iw_cxgb4: Fix potential NULL dereference in c4iw_fill_res_cm_id_entry()
RDMA/mlx5: Use rdma_umem_for_each_dma_block()
RDMA/umem: Remove unused 'work' member from struct ib_umem
RDMA/irdma: Cap MSIX used to online CPUs + 1
RDMA/mlx5: Check reg_create() create for errors
RDMA/restrack: Correct spelling
RDMA/cxgb4: Fix potential null-ptr-deref in pass_establish()
...
Clang can do some aggressive inlining, which provides it with greater
visibility into the sizes of various objects that are passed into
helpers. Specifically, compare_netdev_and_ip() can see through the type
given to the "sa" argument, which means it can generate code for "struct
sockaddr_in" that would have been passed to ipv6_addr_cmp() (that expects
to operate on the larger "struct sockaddr_in6"), which would result in a
compile-time buffer overflow condition detected by memcmp(). Logically,
this state isn't reachable due to the sa_family assignment two callers
above and the check in compare_netdev_and_ip(). Instead, provide a
compile-time check on sizes so the size-mismatched code will be elided
when inlining. Avoids the following warning from Clang:
../include/linux/fortify-string.h:652:4: error: call to '__read_overflow' declared with 'error' attribute: detected read beyond size of object (1st parameter)
__read_overflow();
^
note: In function 'cma_netevent_callback'
note: which inlined function 'node_from_ndev_ip'
1 error generated.
When the underlying object size is not known (e.g. with GCC and older
Clang), the result of __builtin_object_size() is SIZE_MAX, which will also
compile away, leaving the code as it was originally.
Link: https://lore.kernel.org/r/20230208232549.never.139-kees@kernel.org
Link: https://github.com/ClangBuiltLinux/linux/issues/1687
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Nathan Chancellor <nathan@kernel.org> # build
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The cited commit moves umem to call the unlocked versions of dmabuf
unmap/map attachment, but the lock is held while calling to these
functions, hence move back to the locked versions of these APIs.
Fixes: 21c9c5c078 ("RDMA/umem: Prepare to dynamic dma-buf locking specification")
Link: https://lore.kernel.org/r/311c2cb791f8af75486df446819071357353db1b.1675088709.git.leon@kernel.org
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When registering a new DMA MR after selecting the best aligned page size
for it, we iterate over the given sglist to split each entry to smaller,
aligned to the selected page size, DMA blocks.
In given circumstances where the sg entry and page size fit certain
sizes and the sg entry is not aligned to the selected page size, the
total size of the aligned pages we need to cover the sg entry is >= 4GB.
Under this circumstances, while iterating page aligned blocks, the
counter responsible for counting how much we advanced from the start of
the sg entry is overflowed because its type is u32 and we pass 4GB in
size. This can lead to an infinite loop inside the iterator function
because the overflow prevents the counter to be larger
than the size of the sg entry.
Fix the presented problem by changing the advancement condition to
eliminate overflow.
Backtrace:
[ 192.374329] efa_reg_user_mr_dmabuf
[ 192.376783] efa_register_mr
[ 192.382579] pgsz_bitmap 0xfffff000 rounddown 0x80000000
[ 192.386423] pg_sz [0x80000000] umem_length[0xc0000000]
[ 192.392657] start 0x0 length 0xc0000000 params.page_shift 31 params.page_num 3
[ 192.399559] hp_cnt[3], pages_in_hp[524288]
[ 192.403690] umem->sgt_append.sgt.nents[1]
[ 192.407905] number entries: [1], pg_bit: [31]
[ 192.411397] biter->__sg_nents [1] biter->__sg [0000000008b0c5d8]
[ 192.415601] biter->__sg_advance [665837568] sg_dma_len[3221225472]
[ 192.419823] biter->__sg_nents [1] biter->__sg [0000000008b0c5d8]
[ 192.423976] biter->__sg_advance [2813321216] sg_dma_len[3221225472]
[ 192.428243] biter->__sg_nents [1] biter->__sg [0000000008b0c5d8]
[ 192.432397] biter->__sg_advance [665837568] sg_dma_len[3221225472]
Fixes: a808273a49 ("RDMA/verbs: Add a DMA iterator to return aligned contiguous memory blocks")
Signed-off-by: Yonatan Nachum <ynachum@amazon.com>
Link: https://lore.kernel.org/r/20230109133711.13678-1-ynachum@amazon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Refactors based on comments [1] of the multiple path records support
patchset:
- Return failure if not able to set inbound/outbound PRs;
- Simplify the flow when receiving the PRs from netlink channel: When
a good PR response is received, unpack it and call the path_query
callback directly. This saves two memory allocations;
- Define RDMA_PRIMARY_PATH_MAX_REC_NUM in a proper place.
[1] https://lore.kernel.org/linux-rdma/Yyxp9E9pJtUids2o@nvidia.com/
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org> #srp
Link: https://lore.kernel.org/r/7610025d57342b8b6da0f19516c9612f9c3fdc37.1672819376.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Refactor rdma_bind_addr function so that it doesn't require that the
cma destination address be changed before calling it.
So now it will update the destination address internally only when it is
really needed and after passing all the required checks.
Which in turn results in a cleaner and more sensible call and error
handling flows for the functions that call it directly or indirectly.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reported-by: Wei Chen <harperchen1110@gmail.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/3d0e9a2fd62bc10ba02fed1c7c48a48638952320.1672819273.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Here is the set of driver core and kernfs changes for 6.2-rc1.
The "big" change in here is the addition of a new macro,
container_of_const() that will preserve the "const-ness" of a pointer
passed into it.
The "problem" of the current container_of() macro is that if you pass in
a "const *", out of it can comes a non-const pointer unless you
specifically ask for it. For many usages, we want to preserve the
"const" attribute by using the same call. For a specific example, this
series changes the kobj_to_dev() macro to use it, allowing it to be used
no matter what the const value is. This prevents every subsystem from
having to declare 2 different individual macros (i.e.
kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce
the const value at build time, which having 2 macros would not do
either.
The driver for all of this have been discussions with the Rust kernel
developers as to how to properly mark driver core, and kobject, objects
as being "non-mutable". The changes to the kobject and driver core in
this pull request are the result of that, as there are lots of paths
where kobjects and device pointers are not modified at all, so marking
them as "const" allows the compiler to enforce this.
So, a nice side affect of the Rust development effort has been already
to clean up the driver core code to be more obvious about object rules.
All of this has been bike-shedded in quite a lot of detail on lkml with
different names and implementations resulting in the tiny version we
have in here, much better than my original proposal. Lots of subsystem
maintainers have acked the changes as well.
Other than this change, included in here are smaller stuff like:
- kernfs fixes and updates to handle lock contention better
- vmlinux.lds.h fixes and updates
- sysfs and debugfs documentation updates
- device property updates
All of these have been in the linux-next tree for quite a while with no
problems, OTHER than some merge issues with other trees that should be
obvious when you hit them (block tree deletes a driver that this tree
modifies, iommufd tree modifies code that this tree also touches). If
there are merge problems with these trees, please let me know.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY5wz3A8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yks0ACeKYUlVgCsER8eYW+x18szFa2QTXgAn2h/VhZe
1Fp53boFaQkGBjl8mGF8
=v+FB
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of driver core and kernfs changes for 6.2-rc1.
The "big" change in here is the addition of a new macro,
container_of_const() that will preserve the "const-ness" of a pointer
passed into it.
The "problem" of the current container_of() macro is that if you pass
in a "const *", out of it can comes a non-const pointer unless you
specifically ask for it. For many usages, we want to preserve the
"const" attribute by using the same call. For a specific example, this
series changes the kobj_to_dev() macro to use it, allowing it to be
used no matter what the const value is. This prevents every subsystem
from having to declare 2 different individual macros (i.e.
kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce
the const value at build time, which having 2 macros would not do
either.
The driver for all of this have been discussions with the Rust kernel
developers as to how to properly mark driver core, and kobject,
objects as being "non-mutable". The changes to the kobject and driver
core in this pull request are the result of that, as there are lots of
paths where kobjects and device pointers are not modified at all, so
marking them as "const" allows the compiler to enforce this.
So, a nice side affect of the Rust development effort has been already
to clean up the driver core code to be more obvious about object
rules.
All of this has been bike-shedded in quite a lot of detail on lkml
with different names and implementations resulting in the tiny version
we have in here, much better than my original proposal. Lots of
subsystem maintainers have acked the changes as well.
Other than this change, included in here are smaller stuff like:
- kernfs fixes and updates to handle lock contention better
- vmlinux.lds.h fixes and updates
- sysfs and debugfs documentation updates
- device property updates
All of these have been in the linux-next tree for quite a while with
no problems"
* tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (58 commits)
device property: Fix documentation for fwnode_get_next_parent()
firmware_loader: fix up to_fw_sysfs() to preserve const
usb.h: take advantage of container_of_const()
device.h: move kobj_to_dev() to use container_of_const()
container_of: add container_of_const() that preserves const-ness of the pointer
driver core: fix up missed drivers/s390/char/hmcdrv_dev.c class.devnode() conversion.
driver core: fix up missed scsi/cxlflash class.devnode() conversion.
driver core: fix up some missing class.devnode() conversions.
driver core: make struct class.devnode() take a const *
driver core: make struct class.dev_uevent() take a const *
cacheinfo: Remove of_node_put() for fw_token
device property: Add a blank line in Kconfig of tests
device property: Rename goto label to be more precise
device property: Move PROPERTY_ENTRY_BOOL() a bit down
device property: Get rid of __PROPERTY_ENTRY_ARRAY_EL*SIZE*()
kernfs: fix all kernel-doc warnings and multiple typos
driver core: pass a const * into of_device_uevent()
kobject: kset_uevent_ops: make name() callback take a const *
kobject: kset_uevent_ops: make filter() callback take a const *
kobject: make kobject_namespace take a const *
...
Usual size of updates, a new driver a most of the bulk focusing on rxe:
- Usual typos, style, and language updates
- Driver updates for mlx5, irdma, siw, rts, srp, hfi1, hns, erdma, mlx4, srp
- Lots of RXE updates
* Improve reply error handling for bad MR operations
* Code tidying
* Debug printing uses common loggers
* Remove half implemented RD related stuff
* Support IBA's recently defined Atomic Write and Flush operations
- erdma support for atomic operations
- New driver "mana" for Ethernet HW available in Azure VMs. This driver
only supports DPDK
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCY5eIggAKCRCFwuHvBreF
YeX7AP9+l5Y9J48OmK7y/YgADNo9g05agXp3E8EuUDmBU+PREgEAigdWaJVf2oea
IctVja0ApLW5W+wsFt8Qh+V4PMiYTAM=
=Q5V+
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Usual size of updates, a new driver, and most of the bulk focusing on
rxe:
- Usual typos, style, and language updates
- Driver updates for mlx5, irdma, siw, rts, srp, hfi1, hns, erdma,
mlx4, srp
- Lots of RXE updates:
* Improve reply error handling for bad MR operations
* Code tidying
* Debug printing uses common loggers
* Remove half implemented RD related stuff
* Support IBA's recently defined Atomic Write and Flush operations
- erdma support for atomic operations
- New driver 'mana' for Ethernet HW available in Azure VMs. This
driver only supports DPDK"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (122 commits)
IB/IPoIB: Fix queue count inconsistency for PKEY child interfaces
RDMA: Add missed netdev_put() for the netdevice_tracker
RDMA/rxe: Enable RDMA FLUSH capability for rxe device
RDMA/cm: Make QP FLUSHABLE for supported device
RDMA/rxe: Implement flush completion
RDMA/rxe: Implement flush execution in responder side
RDMA/rxe: Implement RC RDMA FLUSH service in requester side
RDMA/rxe: Extend rxe packet format to support flush
RDMA/rxe: Allow registering persistent flag for pmem MR only
RDMA/rxe: Extend rxe user ABI to support flush
RDMA: Extend RDMA kernel verbs ABI to support flush
RDMA: Extend RDMA user ABI to support flush
RDMA/rxe: Fix incorrect responder length checking
RDMA/rxe: Fix oops with zero length reads
RDMA/mlx5: Remove not-used IB_FLOW_SPEC_IB define
RDMA/hns: Fix XRC caps on HIP08
RDMA/hns: Fix error code of CMD
RDMA/hns: Fix page size cap from firmware
RDMA/hns: Fix PBL page MTR find
RDMA/hns: Fix AH attr queried by query_qp
...
- More userfaultfs work from Peter Xu.
- Several convert-to-folios series from Sidhartha Kumar and Huang Ying.
- Some filemap cleanups from Vishal Moola.
- David Hildenbrand added the ability to selftest anon memory COW handling.
- Some cpuset simplifications from Liu Shixin.
- Addition of vmalloc tracing support by Uladzislau Rezki.
- Some pagecache folioifications and simplifications from Matthew Wilcox.
- A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it.
- Miguel Ojeda contributed some cleanups for our use of the
__no_sanitize_thread__ gcc keyword. This series shold have been in the
non-MM tree, my bad.
- Naoya Horiguchi improved the interaction between memory poisoning and
memory section removal for huge pages.
- DAMON cleanups and tuneups from SeongJae Park
- Tony Luck fixed the handling of COW faults against poisoned pages.
- Peter Xu utilized the PTE marker code for handling swapin errors.
- Hugh Dickins reworked compound page mapcount handling, simplifying it
and making it more efficient.
- Removal of the autonuma savedwrite infrastructure from Nadav Amit and
David Hildenbrand.
- zram support for multiple compression streams from Sergey Senozhatsky.
- David Hildenbrand reworked the GUP code's R/O long-term pinning so
that drivers no longer need to use the FOLL_FORCE workaround which
didn't work very well anyway.
- Mel Gorman altered the page allocator so that local IRQs can remnain
enabled during per-cpu page allocations.
- Vishal Moola removed the try_to_release_page() wrapper.
- Stefan Roesch added some per-BDI sysfs tunables which are used to
prevent network block devices from dirtying excessive amounts of
pagecache.
- David Hildenbrand did some cleanup and repair work on KSM COW
breaking.
- Nhat Pham and Johannes Weiner have implemented writeback in zswap's
zsmalloc backend.
- Brian Foster has fixed a longstanding corner-case oddity in
file[map]_write_and_wait_range().
- sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
Chen.
- Shiyang Ruan has done some work on fsdax, to make its reflink mode
work better under xfstests. Better, but still not perfect.
- Christoph Hellwig has removed the .writepage() method from several
filesystems. They only need .writepages().
- Yosry Ahmed wrote a series which fixes the memcg reclaim target
beancounting.
- David Hildenbrand has fixed some of our MM selftests for 32-bit
machines.
- Many singleton patches, as usual.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY5j6ZwAKCRDdBJ7gKXxA
jkDYAP9qNeVqp9iuHjZNTqzMXkfmJPsw2kmy2P+VdzYVuQRcJgEAgoV9d7oMq4ml
CodAgiA51qwzId3GRytIo/tfWZSezgA=
=d19R
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- More userfaultfs work from Peter Xu
- Several convert-to-folios series from Sidhartha Kumar and Huang Ying
- Some filemap cleanups from Vishal Moola
- David Hildenbrand added the ability to selftest anon memory COW
handling
- Some cpuset simplifications from Liu Shixin
- Addition of vmalloc tracing support by Uladzislau Rezki
- Some pagecache folioifications and simplifications from Matthew
Wilcox
- A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use
it
- Miguel Ojeda contributed some cleanups for our use of the
__no_sanitize_thread__ gcc keyword.
This series should have been in the non-MM tree, my bad
- Naoya Horiguchi improved the interaction between memory poisoning and
memory section removal for huge pages
- DAMON cleanups and tuneups from SeongJae Park
- Tony Luck fixed the handling of COW faults against poisoned pages
- Peter Xu utilized the PTE marker code for handling swapin errors
- Hugh Dickins reworked compound page mapcount handling, simplifying it
and making it more efficient
- Removal of the autonuma savedwrite infrastructure from Nadav Amit and
David Hildenbrand
- zram support for multiple compression streams from Sergey Senozhatsky
- David Hildenbrand reworked the GUP code's R/O long-term pinning so
that drivers no longer need to use the FOLL_FORCE workaround which
didn't work very well anyway
- Mel Gorman altered the page allocator so that local IRQs can remnain
enabled during per-cpu page allocations
- Vishal Moola removed the try_to_release_page() wrapper
- Stefan Roesch added some per-BDI sysfs tunables which are used to
prevent network block devices from dirtying excessive amounts of
pagecache
- David Hildenbrand did some cleanup and repair work on KSM COW
breaking
- Nhat Pham and Johannes Weiner have implemented writeback in zswap's
zsmalloc backend
- Brian Foster has fixed a longstanding corner-case oddity in
file[map]_write_and_wait_range()
- sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang
Chen
- Shiyang Ruan has done some work on fsdax, to make its reflink mode
work better under xfstests. Better, but still not perfect
- Christoph Hellwig has removed the .writepage() method from several
filesystems. They only need .writepages()
- Yosry Ahmed wrote a series which fixes the memcg reclaim target
beancounting
- David Hildenbrand has fixed some of our MM selftests for 32-bit
machines
- Many singleton patches, as usual
* tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits)
mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio
mm: mmu_gather: allow more than one batch of delayed rmaps
mm: fix typo in struct pglist_data code comment
kmsan: fix memcpy tests
mm: add cond_resched() in swapin_walk_pmd_entry()
mm: do not show fs mm pc for VM_LOCKONFAULT pages
selftests/vm: ksm_functional_tests: fixes for 32bit
selftests/vm: cow: fix compile warning on 32bit
selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions
mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem
mm,thp,rmap: fix races between updates of subpages_mapcount
mm: memcg: fix swapcached stat accounting
mm: add nodes= arg to memory.reclaim
mm: disable top-tier fallback to reclaim on proactive reclaim
selftests: cgroup: make sure reclaim target memcg is unprotected
selftests: cgroup: refactor proactive reclaim code to reclaim_until()
mm: memcg: fix stale protection of reclaim target memcg
mm/mmap: properly unaccount memory on mas_preallocate() failure
omfs: remove ->writepage
jfs: remove ->writepage
...
Initial accel subsystem support. There are no drivers yet, just the framework.
New driver:
- ofdrm - replacement for offb
fbdev:
- add support for nomodeset
fourcc:
- add Vivante tiled modifier
core:
- atomic-helpers: CRTC primary plane test fixes, fb access hooks
- connector: TV API consistency, cmdline parser improvements
- send connector hotplug on cleanup
- sort makefile objects
tests:
- sort kunit tests
- improve DP-MST tests
- add kunit helpers to create a device
sched:
- module param for scheduling policy
- refcounting fix
buddy:
- add back random seed log
ttm:
- convert ttm_resource to size_t
- optimize pool allocations
edid:
- HFVSDB parsing support fixes
- logging/debug improvements
- DSC quirks
dma-buf:
- Add unlocked vmap and attachment mapping
- move drivers to common locking convention
- locking improvements
firmware:
- new API for rPI firmware and vc4
xilinx:
- zynqmp: displayport bridge support
- dpsub fix
bridge:
- adv7533: Remove dynamic lane switching
- it6505: Runtime PM support, sync improvements
- ps8640: Handle AUX defer messages
- tc358775: Drop soft-reset over I2C
panel:
- panel-edp: Add INX N116BGE-EA2 C2 and C4 support.
- Jadard JD9365DA-H3
- NewVision NV3051D
amdgpu:
- DCN support on ARM
- DCN 2.1 secure display
- Sienna Cichlid mode2 reset fixes
- new GC 11.x firmware versions
- drop AMD specific DSC workarounds in favour of drm code
- clang warning fixes
- scheduler rework
- SR-IOV fixes
- GPUVM locking fixes
- fix memory leak in CS IOCTL error path
- flexible array updates
- enable new GC/PSP/SMU/NBIO IP
- GFX preemption support for gfx9
amdkfd:
- cache size fixes
- userptr fixes
- enable cooperative launch on gfx 10.3
- enable GC 11.0.4 KFD support
radeon:
- replace kmap with kmap_local_page
- ACPI ref count fix
- HDA audio notifier support
i915:
- DG2 enabled by default
- MTL enablement work
- hotplug refactoring
- VBT improvements
- Display and watermark refactoring
- ADL-P workaround
- temp disable runtime_pm for discrete-
- fix for A380 as a secondary GPU
- Wa_18017747507 for DG2
- CS timestamp support fixes for gen5 and earlier
- never purge busy TTM objects
- use i915_sg_dma_sizes for all backends
- demote GuC kernel contexts to normal priority
- gvt: refactor for new MDEV interface
- enable DC power states on eDP ports
- fix gen 2/3 workarounds
nouveau:
- fix page fault handling
- Ampere acceleration support
- driver stability improvements
- nva3 backlight support
msm:
- MSM_INFO_GET_FLAGS support
- DPU: XR30 and P010 image formats
- Qualcomm SM6115 support
- DSI PHY support for QCM2290
- HDMI: refactored dev init path
- remove exclusive-fence hack
- fix speed-bin detection
- enable clamp to idle on 7c3
- improved hangcheck detection
vmwgfx:
- fb and cursor refactoring
- convert to generic hashtable
- cursor improvements
etnaviv:
- hw workarounds
- softpin MMU fixes
ast:
- atomic gamma LUT support
- convert to SHMEM
lcdif:
- support YUV planes
- Increase DMA burst size
- FIFO threshold tuning
meson:
- fix return type of cvbs mode_valid
mgag200:
- fix PLL setup on some revisions
sun4i:
- A100 and D1 support
udl:
- modesetting improvements
- hot unplug support
vc4:
- support PAL-M
- fix regression preventing 4K @ 60Hz
- fix NULL ptr deref
v3d:
- switch to drm managed resources
renesas:
- RZ/G2L DSI support
- DU Kconfig cleanup
mediatek:
- fixup dpi and hdmi
- MT8188 dpi support
- MT8195 AFBC support
tegra:
- NVDEC hardware on Tegra234 SoC
hdlcd:
- switch to drm managed resources
ingenic:
- fix registration error path
hisilicon:
- convert to drm_mode_init
maildp:
- use managed resources
mtk:
- use drm_mode_init
rockchip:
- use drm_mode_copy
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmOXxI0ACgkQDHTzWXnE
hr4NyBAAojK3N+XJf2b8LWuRKsShCr5FXlteEDxiYGLeB8/g4x3LztSfHgUg0iuS
nP1m7Cx4snXcVNS6iyOsoZVq1EGUAWvv+mPWJe1UywjpyqtciTVQ11GEHRvI/w+V
GRvkhmt/TsoZA0QIlS2MaOmhn9j17QOcuYTUjYdyRL4tsrHWrTASH5W1Jt2xmDyw
5FUJvfukPWm100DVWbh6hWbCKL22bDDF/nj1H+G6hYSyTjVbk7wZ0vy2m6TnIHNF
iyBHBIzFPg3BveiSlKe6aVX7Gq2d8bfqjHsgN5f1qcS4ejWEkHLVxJtBdOB+fOSC
7o8Ms7WHi1AmnkOVCGRIjJ0cJrLZu2HDlyhViguAO1XQ3Jvuo/4WW3mplv+YPOMc
c+P/zuPG42d4lrISuB8wspTdOgxmqpZDkg3HE6n1+jiVR0u4hTTYktoPnLsHX6KG
l/l2B6aVAxE4b6P0q3ofYoAnk5rNsb1YUS+a8kC6f97TQ3gmOsN75iZXD/ASHg2r
ozhh2wcFxIPkJhE7vqLWPIBCWQs93sGyQXoI7Q0TJaIAZTXV0VmO1BIofetpVImE
7FhDC4wvBedXywN8NYUEFbCTOnIcDMteM/i6S1ns78s5UjDa5osPuS5I02VT1lbN
tvnJoHNkhCt13lJz63b0HNFm3cPKoRosCQhJeshyUYaFKs+evL0=
=pABG
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2022-12-13' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"The biggest highlight is that the accel subsystem framework is merged.
Hopefully for 6.3 we will be able to line up a driver to use it.
In drivers land, i915 enables DG2 support by default now, and nouveau
has a big stability refactoring and initial ampere support, AMD
includes new hw IP support and should build on ARM again. There is
also an ofdrm driver to take over offb on platforms it's used.
Stuff outside my tree, the dma-buf patches hit a few places, the vc4
firmware changes also do, and i915 has some interactions with MEI for
discrete GPUs. I think all of those should have been acked/reviewed by
relevant parties.
New driver:
- ofdrm - replacement for offb
fbdev:
- add support for nomodeset
fourcc:
- add Vivante tiled modifier
core:
- atomic-helpers: CRTC primary plane test fixes, fb access hooks
- connector: TV API consistency, cmdline parser improvements
- send connector hotplug on cleanup
- sort makefile objects
tests:
- sort kunit tests
- improve DP-MST tests
- add kunit helpers to create a device
sched:
- module param for scheduling policy
- refcounting fix
buddy:
- add back random seed log
ttm:
- convert ttm_resource to size_t
- optimize pool allocations
edid:
- HFVSDB parsing support fixes
- logging/debug improvements
- DSC quirks
dma-buf:
- Add unlocked vmap and attachment mapping
- move drivers to common locking convention
- locking improvements
firmware:
- new API for rPI firmware and vc4
xilinx:
- zynqmp: displayport bridge support
- dpsub fix
bridge:
- adv7533: Remove dynamic lane switching
- it6505: Runtime PM support, sync improvements
- ps8640: Handle AUX defer messages
- tc358775: Drop soft-reset over I2C
panel:
- panel-edp: Add INX N116BGE-EA2 C2 and C4 support.
- Jadard JD9365DA-H3
- NewVision NV3051D
amdgpu:
- DCN support on ARM
- DCN 2.1 secure display
- Sienna Cichlid mode2 reset fixes
- new GC 11.x firmware versions
- drop AMD specific DSC workarounds in favour of drm code
- clang warning fixes
- scheduler rework
- SR-IOV fixes
- GPUVM locking fixes
- fix memory leak in CS IOCTL error path
- flexible array updates
- enable new GC/PSP/SMU/NBIO IP
- GFX preemption support for gfx9
amdkfd:
- cache size fixes
- userptr fixes
- enable cooperative launch on gfx 10.3
- enable GC 11.0.4 KFD support
radeon:
- replace kmap with kmap_local_page
- ACPI ref count fix
- HDA audio notifier support
i915:
- DG2 enabled by default
- MTL enablement work
- hotplug refactoring
- VBT improvements
- Display and watermark refactoring
- ADL-P workaround
- temp disable runtime_pm for discrete-
- fix for A380 as a secondary GPU
- Wa_18017747507 for DG2
- CS timestamp support fixes for gen5 and earlier
- never purge busy TTM objects
- use i915_sg_dma_sizes for all backends
- demote GuC kernel contexts to normal priority
- gvt: refactor for new MDEV interface
- enable DC power states on eDP ports
- fix gen 2/3 workarounds
nouveau:
- fix page fault handling
- Ampere acceleration support
- driver stability improvements
- nva3 backlight support
msm:
- MSM_INFO_GET_FLAGS support
- DPU: XR30 and P010 image formats
- Qualcomm SM6115 support
- DSI PHY support for QCM2290
- HDMI: refactored dev init path
- remove exclusive-fence hack
- fix speed-bin detection
- enable clamp to idle on 7c3
- improved hangcheck detection
vmwgfx:
- fb and cursor refactoring
- convert to generic hashtable
- cursor improvements
etnaviv:
- hw workarounds
- softpin MMU fixes
ast:
- atomic gamma LUT support
- convert to SHMEM
lcdif:
- support YUV planes
- Increase DMA burst size
- FIFO threshold tuning
meson:
- fix return type of cvbs mode_valid
mgag200:
- fix PLL setup on some revisions
sun4i:
- A100 and D1 support
udl:
- modesetting improvements
- hot unplug support
vc4:
- support PAL-M
- fix regression preventing 4K @ 60Hz
- fix NULL ptr deref
v3d:
- switch to drm managed resources
renesas:
- RZ/G2L DSI support
- DU Kconfig cleanup
mediatek:
- fixup dpi and hdmi
- MT8188 dpi support
- MT8195 AFBC support
tegra:
- NVDEC hardware on Tegra234 SoC
hdlcd:
- switch to drm managed resources
ingenic:
- fix registration error path
hisilicon:
- convert to drm_mode_init
maildp:
- use managed resources
mtk:
- use drm_mode_init
rockchip:
- use drm_mode_copy"
* tag 'drm-next-2022-12-13' of git://anongit.freedesktop.org/drm/drm: (1397 commits)
drm/amdgpu: fix mmhub register base coding error
drm/amdgpu: add tmz support for GC IP v11.0.4
drm/amdgpu: enable GFX Clock Gating control for GC IP v11.0.4
drm/amdgpu: enable GFX Power Gating for GC IP v11.0.4
drm/amdgpu: enable GFX IP v11.0.4 CG support
drm/amdgpu: Make amdgpu_ring_mux functions as static
drm/amdgpu: generally allow over-commit during BO allocation
drm/amd/display: fix array index out of bound error in DCN32 DML
drm/amd/display: 3.2.215
drm/amd/display: set optimized required for comp buf changes
drm/amd/display: Add debug option to skip PSR CRTC disable
drm/amd/display: correct DML calc error of UrgentLatency
drm/amd/display: correct static_screen_event_mask
drm/amd/display: Ensure commit_streams returns the DC return code
drm/amd/display: read invalid ddc pin status cause engine busy
drm/amd/display: Bypass DET swath fill check for max clocks
drm/amd/display: Disable uclk pstate for subvp pipes
drm/amd/display: Fix DCN2.1 default DSC clocks
drm/amd/display: Enable dp_hdmi21_pcon support
drm/amd/display: prevent seamless boot on displays that don't have the preferred dig
...
This release introduces support for the CB_RECALL_ANY operation.
NFSD can send this operation to request that clients return any
delegations they choose. The server uses this operation to handle
low memory scenarios or indicate to a client when that client has
reached the maximum number of delegations the server supports.
The NFSv4.2 READ_PLUS operation has been simplified temporarily
whilst support for sparse files in local filesystems and the VFS is
improved.
Two major data structure fixes appear in this release:
* The nfs4_file hash table is replaced with a resizable hash table
to reduce the latency of NFSv4 OPEN operations.
* Reference counting in the NFSD filecache has been hardened against
races.
In furtherance of removing support for NFSv2 in a subsequent kernel
release, a new Kconfig option enables server-side support for NFSv2
to be left out of a kernel build.
MAINTAINERS has been updated to indicate that changes to fs/exportfs
should go through the NFSD tree.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmOXPE4ACgkQM2qzM29m
f5faaRAAh7YT5V61afPbfgBybO5AbDzztpZSNjNjLZs78piSnFp6hP75yNtTviwQ
1o7St13/NkCmDaIdGUpr02U01zbM1BDOq2wGckImOJLNSgb7xHV5r4PqkRiFkh0t
QYSnwG+wp8fDUJeCL/nAOAu9I9EQUqHzWchxiU/h8ln2hN3rXUlIRSeo17Wy7zkD
cNIcoAjTi9fzY3dE6H4r+lZTdNCYH+AdzChmKrHdRZQwq0Xs3FWv4gAMTLbDuD4P
B6NDHz0Umn6XnFsJGptwozkwaWeMQw4GyJj/3iUiO8JF209SaoYXMPjJAyG6tYYa
fUrgv4UXGeXjigDbLBA5IYxfhX7GXjMQSaj3edhzyrl8P74q4/Cq/8fDUnAZ841m
E+TGSCPIQD0QuIjdXxLv9KLY8JNThSfcAt6jr5GBXhPZQr8xpS0BqK/Onr68fgZC
Lpull5xN68L4A1B7cf2GNPuMyvkBKxwSGXOehldh/BkvpVMjFwqd4/q5xWC+6CcQ
tbOkjTbbSS71nzJwZip0NphaYCa3qQPzKT4SZzn/I4I9W5otbwYBx734Bw46gTDE
ZPUXTuJ00VPgX07wbLRahg521Fwzr+8sk1WnVYq82PoaMh1l9FjzLNGouQWBdo3E
UzIo/KUfQKmoZce6O723L6OI4ffdK5oMtfaTpe+SiUPpV1lUAcA=
=jNlu
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"This release introduces support for the CB_RECALL_ANY operation. NFSD
can send this operation to request that clients return any delegations
they choose. The server uses this operation to handle low memory
scenarios or indicate to a client when that client has reached the
maximum number of delegations the server supports.
The NFSv4.2 READ_PLUS operation has been simplified temporarily whilst
support for sparse files in local filesystems and the VFS is improved.
Two major data structure fixes appear in this release:
- The nfs4_file hash table is replaced with a resizable hash table to
reduce the latency of NFSv4 OPEN operations.
- Reference counting in the NFSD filecache has been hardened against
races.
In furtherance of removing support for NFSv2 in a subsequent kernel
release, a new Kconfig option enables server-side support for NFSv2 to
be left out of a kernel build.
MAINTAINERS has been updated to indicate that changes to fs/exportfs
should go through the NFSD tree"
* tag 'nfsd-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (49 commits)
NFSD: Avoid clashing function prototypes
SUNRPC: Fix crasher in unwrap_integ_data()
SUNRPC: Make the svc_authenticate tracepoint conditional
NFSD: Use only RQ_DROPME to signal the need to drop a reply
SUNRPC: Clean up xdr_write_pages()
SUNRPC: Don't leak netobj memory when gss_read_proxy_verf() fails
NFSD: add CB_RECALL_ANY tracepoints
NFSD: add delegation reaper to react to low memory condition
NFSD: add support for sending CB_RECALL_ANY
NFSD: refactoring courtesy_client_reaper to a generic low memory shrinker
trace: Relocate event helper files
NFSD: pass range end to vfs_fsync_range() instead of count
lockd: fix file selection in nlmsvc_cancel_blocked
lockd: ensure we use the correct file descriptor when unlocking
lockd: set missing fl_flags field when retrieving args
NFSD: Use struct_size() helper in alloc_session()
nfsd: return error if nfs4_setacl fails
lockd: set other missing fields when unlocking files
NFSD: Add an nfsd_file_fsync tracepoint
sunrpc: svc: Remove an unused static function svc_ungetu32()
...
Steven Rostedt says:
> The include/trace/events/ directory should only hold files that
> are to create events, not headers that hold helper functions.
>
> Can you please move them out of include/trace/events/ as that
> directory is "special" in the creation of events.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Similar to RDMA and Atomic qp attributes enabled by default in CM, enable
FLUSH attribute for supported device. That makes applications that are
built with rdma_create_ep, rdma_accept APIs have FLUSH qp attribute
natively so that user is able to request FLUSH operation simpler.
Note that, a FLUSH operation requires FLUSH are supported by both
device(HCA) and memory region(MR) and QP at the same time, so it's safe
to enable FLUSH qp attribute by default here.
FLUSH attribute can be disable by modify_qp() interface.
Link: https://lore.kernel.org/r/20221206130201.30986-10-lizhijian@fujitsu.com
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmONI6weHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG9xgH/jqXGuMoO1ikfmGb
7oY0W/f69G9V/e0DxFLvnIjhFgCUzdnNsmD4jQJA4x6QsxwLWuvpI282Ez+bHV5T
U4RPsxJZIIMsXE2lKM9BRgeLzDdCt0aK4Pj+3x2x7NZC5cWFSQ8PyQJkCwg+0PQo
u8Ly+GO8c4RUMf4/rrAZQq16qZUqGDaGm1EJhtSoa+KiR81LmUUmbDIK9Mr53rmQ
wou+95XhibwMWr17WgXA28bTgYqn9UGr67V3qvTH2LC7GW8BCoKvn+3wh6TVhlWj
dsWplXgcOP0/OHvSC5Sb1Uibk5Gx3DlIzYa6OfNZQuZ5xmQqm9kXjW8lmYpWFHy/
38/5HWc=
=EuoA
-----END PGP SIGNATURE-----
Merge tag 'v6.1-rc8' into rdma.git for-next
For dependencies in following patches
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The ack timeout retransmission time is affected by the following two
factors: one is packet life time, another is the HCA processing time.
Now the default packet lifetime(CMA_IBOE_PACKET_LIFETIME) is 18.
That means the minimum ack timeout is 2
seconds (2^(18+1)*4us=2.097seconds). The packet lifetime means the
maximum transmission time of packets on the network, 2 seconds is too
long.
Assume the network is a clos topology with three layers, every packet will
pass through five hops of switches. Assume the buffer of every switch is
128MB and the port transmission rate is 25 Gbit/s, the maximum
transmission time of the packet is 200ms(128MB*5/25Gbit/s). Add double
redundancy, it is less than 500ms.
So change the CMA_IBOE_PACKET_LIFETIME to 16, the maximum transmission
time of the packet will be about 500+ms, it is long enough.
Link: https://lore.kernel.org/r/20221125010026.755-1-lengchao@huawei.com
Signed-off-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
GUP now supports reliable R/O long-term pinning in COW mappings, such
that we break COW early. MAP_SHARED VMAs only use the shared zeropage so
far in one corner case (DAXFS file with holes), which can be ignored
because GUP does not support long-term pinning in fsdax (see
check_vma_flags()).
Consequently, FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM is no longer required
for reliable R/O long-term pinning: FOLL_LONGTERM is sufficient. So stop
using FOLL_FORCE, which is really only for ptrace access.
Link: https://lkml.kernel.org/r/20221116102659.70287-11-david@redhat.com
Tested-by: Leon Romanovsky <leonro@nvidia.com> [over mlx4 and mlx5]
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Return "-EMSGSIZE" instead of "-EINVAL" when filling a QP entry, so that
new SKBs will be allocated if there's not enough room in current SKB.
Fixes: 65959522f8 ("RDMA: Add support to dump resource tracker in RAW format")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://lore.kernel.org/r/b5e9c62f6b8369acab5648b661bf539cbceeffdc.1669636336.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Using nlmsg_put causes static analysis tools to many
false positives of not checking the return value of nlmsg_put.
In all uses in nldev.c, payload parameter is 0 so NULL will never
be returned. So let's add useless checks to silence the warnings.
Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/bd924da89d5b4f5291a4a01d9b5ae47c0a9b6a3f.1669636336.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
As the nla_nest_start() may fail with NULL returned, the return value needs
to be checked.
Fixes: c4ffee7c9b ("RDMA/netlink: Implement counter dumpit calback")
Signed-off-by: Yuan Can <yuancan@huawei.com>
Link: https://lore.kernel.org/r/20221126043410.85632-1-yuancan@huawei.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This will cause an informative backtrace to print if the user of
ib_device_set_netdev() isn't careful about tearing down the ibdevice
before its the netdevice parent is destroyed. Such as like this:
unregister_netdevice: waiting for vlan0 to become free. Usage count = 2
leaked reference.
ib_device_set_netdev+0x266/0x730
siw_newlink+0x4e0/0xfd0
nldev_newlink+0x35c/0x5c0
rdma_nl_rcv_msg+0x36d/0x690
rdma_nl_rcv+0x2ee/0x430
netlink_unicast+0x543/0x7f0
netlink_sendmsg+0x918/0xe20
sock_sendmsg+0xcf/0x120
____sys_sendmsg+0x70d/0x8b0
___sys_sendmsg+0x11d/0x1b0
__sys_sendmsg+0xfa/0x1d0
do_syscall_64+0x35/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
This will help debug the issues syzkaller is seeing.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v1-a7c81b3842ce+e5-netdev_tracker_jgg@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The devnode() in struct class should not be modifying the device that is
passed into it, so mark it as a const * and propagate the function
signature changes out into all relevant subsystems that use this
callback.
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Reinette Chatre <reinette.chatre@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: x86@kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Justin Sanders <justin@coraid.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Gaignard <benjamin.gaignard@collabora.com>
Cc: Liam Mark <lmark@codeaurora.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Brian Starkey <Brian.Starkey@arm.com>
Cc: John Stultz <jstultz@google.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Sean Young <sean@mess.org>
Cc: Frank Haverkamp <haver@linux.ibm.com>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Xie Yongji <xieyongji@bytedance.com>
Cc: Gautam Dawar <gautam.dawar@xilinx.com>
Cc: Dan Carpenter <error27@gmail.com>
Cc: Eli Cohen <elic@nvidia.com>
Cc: Parav Pandit <parav@nvidia.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: alsa-devel@alsa-project.org
Cc: dri-devel@lists.freedesktop.org
Cc: kvm@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: linux-block@vger.kernel.org
Cc: linux-input@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Link: https://lore.kernel.org/r/20221123122523.1332370-2-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The dev_uevent() in struct class should not be modifying the device that
is passed into it, so mark it as a const * and propagate the function
signature changes out into all relevant subsystems that use this
callback.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Russ Weight <russell.h.weight@intel.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Johan Hovold <johan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Raed Salem <raeds@nvidia.com>
Cc: Chen Zhongjin <chenzhongjin@huawei.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Avihai Horon <avihaih@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Jakob Koschel <jakobkoschel@gmail.com>
Cc: Antoine Tenart <atenart@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Wang Yufen <wangyufen@huawei.com>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-media@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Cc: linux-pm@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Sebastian Reichel <sre@kernel.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20221123122523.1332370-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmN6wAgeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG0EYH/3/RO90NbrFItraN
Lzr+d3VdbGjTu8xd1M+PRTmwh3zxLpB+Jwqr0T0A2gzL9B/D+AUPUJdrCVbv9DqS
FLJAVqoeV20dNBAHSffOOLPsgCZ+Eu+LzlNN7Iqde0e8cyZICFMNktitui84Xm/i
1NgFVgz9OZ6+aieYvUj3FrFq0p8GTIaC/oybDZrxYKcO8ZzKVMJ11swRw10wwq0g
qOOECvV3w7wlQ8upQZkzFxItKFc7EexZI6R4elXeGSJJ9Hlc092dv/zsKB9dwV+k
WcwkJrZRoezYXzgGBFxUcQtzi+ethjrPjuJuM1rYLUSIcfIW/0lkaSLgRoBu8D+I
1GfXkXs=
=gt6P
-----END PGP SIGNATURE-----
Backmerge tag 'v6.1-rc6' into drm-next
Linux 6.1-rc6
This is needed for drm-misc-next and tegra.
Signed-off-by: Dave Airlie <airlied@redhat.com>
These cases were done with this Coccinelle:
@@
expression H;
expression L;
@@
- (get_random_u32_below(H) + L)
+ get_random_u32_inclusive(L, H + L - 1)
@@
expression H;
expression L;
expression E;
@@
get_random_u32_inclusive(L,
H
- + E
- - E
)
@@
expression H;
expression L;
expression E;
@@
get_random_u32_inclusive(L,
H
- - E
- + E
)
@@
expression H;
expression L;
expression E;
expression F;
@@
get_random_u32_inclusive(L,
H
- - E
+ F
- + E
)
@@
expression H;
expression L;
expression E;
expression F;
@@
get_random_u32_inclusive(L,
H
- + E
+ F
- - E
)
And then subsequently cleaned up by hand, with several automatic cases
rejected if it didn't make sense contextually.
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
This is a simple mechanical transformation done by:
@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
When filling a cm_id entry, return "-EAGAIN" instead of 0 if the cm_id
doesn'the have the same port as requested, otherwise an incomplete entry
may be returned, which causes "rdam res show cm_id" to return an error.
For example on a machine with two rdma devices with "rping -C 1 -v -s"
running background, the "rdma" command fails:
$ rdma -V
rdma utility, iproute2-5.19.0
$ rdma res show cm_id
link mlx5_0/- cm-idn 0 state LISTEN ps TCP pid 28056 comm rping src-addr 0.0.0.0:7174
error: Protocol not available
While with this fix it succeeds:
$ rdma res show cm_id
link mlx5_0/- cm-idn 0 state LISTEN ps TCP pid 26395 comm rping src-addr 0.0.0.0:7174
link mlx5_1/- cm-idn 0 state LISTEN ps TCP pid 26395 comm rping src-addr 0.0.0.0:7174
Fixes: 00313983cd ("RDMA/nldev: provide detailed CM_ID information")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Link: https://lore.kernel.org/r/a08e898cdac5e28428eb749a99d9d981571b8ea7.1667810736.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The MR restrack also needs to be released when delete it, otherwise it
cause memory leak as the task struct won't be released.
Fixes: 13ef5539de ("RDMA/restrack: Count references to the verbs objects")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://lore.kernel.org/r/703db18e8d4ef628691fb93980a709be673e62e3.1667810736.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The callbacks in struct class namespace() and get_ownership() do not
modify the struct device passed to them, so mark the pointer as constant
and fix up all callbacks in the kernel to have the correct function
signature.
This helps make it more obvious what calls and callbacks do, and do not,
modify structures passed to them.
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Link: https://lore.kernel.org/r/20221001165426.2690912-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
KASAN reported a null-ptr-deref error:
KASAN: null-ptr-deref in range [0x0000000000000118-0x000000000000011f]
CPU: 1 PID: 379
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:destroy_workqueue+0x2f/0x740
RSP: 0018:ffff888016137df8 EFLAGS: 00000202
...
Call Trace:
ib_core_cleanup+0xa/0xa1 [ib_core]
__do_sys_delete_module.constprop.0+0x34f/0x5b0
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fa1a0d221b7
...
It is because the fail of roce_gid_mgmt_init() is ignored:
ib_core_init()
roce_gid_mgmt_init()
gid_cache_wq = alloc_ordered_workqueue # fail
...
ib_core_cleanup()
roce_gid_mgmt_cleanup()
destroy_workqueue(gid_cache_wq)
# destroy an unallocated wq
Fix this by catching the fail of roce_gid_mgmt_init() in ib_core_init().
Fixes: 03db3a2d81 ("IB/core: Add RoCE GID table management")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Link: https://lore.kernel.org/r/20221025024146.109137-1-chenzhongjin@huawei.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Commit 27cfde795a ("RDMA/cma: Fix arguments order in net device
validation") swapped the src and dst addresses in the call to
validate_net_dev().
As a consequence, the test in validate_ipv4_net_dev() to see if the
net_dev is the right one, is incorrect for port 1 <-> 2 communication when
the ports are on the same sub-net. This is fixed by denoting the
flowi4_oif as the device instead of the incoming one.
The bug has not been observed using IPv6 addresses.
Fixes: 27cfde795a ("RDMA/cma: Fix arguments order in net device validation")
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Link: https://lore.kernel.org/r/20221012141542.16925-1-haakon.bugge@oracle.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Prepare InfiniBand drivers to the common dynamic dma-buf locking
convention by starting to use the unlocked versions of dma-buf API
functions.
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221017172229.42269-11-dmitry.osipenko@collabora.com
Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:
@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)
@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@
- RAND = get_random_u32();
... when != RAND
- RAND %= (E);
+ RAND = prandom_u32_max(E);
// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@
((T)get_random_u32()@p & (LITERAL))
// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@
value = None
if literal.startswith('0x'):
value = int(literal, 16)
elif literal[0] in '123456789':
value = int(literal, 10)
if value is None:
print("I don't know how to handle %s" % (literal))
cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
print("Skipping 0x%x for cleanup elsewhere" % (value))
cocci.include_match(False)
elif value & (value + 1) != 0:
print("Skipping 0x%x because it's not a power of two minus one" % (value))
cocci.include_match(False)
elif literal.startswith('0x'):
coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))
// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@
- (FUNC()@p & (LITERAL))
+ prandom_u32_max(RESULT)
@collapse_ret@
type T;
identifier VAR;
expression E;
@@
{
- T VAR;
- VAR = (E);
- return VAR;
+ return E;
}
@drop_var@
type T;
identifier VAR;
@@
{
- T VAR;
... when != VAR
}
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
This uses the same passing protocol as UVERBS_ATTR_FD (eg len = 0 data_s64
= fd), except that the FD is not required to be a uverbs object and the
core code does not covert the FD to an object handle automatically.
Access to the int fd is provided by uverbs_get_raw_fd().
Link: https://lore.kernel.org/r/2-v1-bd147097458e+ede-umem_dmabuf_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
"&srq->pd->usecnt" and "&pd->usecnt" are different names for the same
reference count. Use "&pd->usecnt" consistently for both the increment
and decrement.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/YyxFe3Pm0uzRuBkQ@kili
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Set 'iova' and 'length' on ib_mr in ib_uverbs and ib_core layers to let all
drivers have the members filled. Also, this commit removes redundancy in
the respective drivers.
Previously, commit 04c0a5fcfc ("IB/uverbs: Set IOVA on IB MR in uverbs
layer") changed to set 'iova', but seems to have missed 'length' and the
ib_core layer at that time.
Fixes: 04c0a5fcfc ("IB/uverbs: Set IOVA on IB MR in uverbs layer")
Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
Link: https://lore.kernel.org/r/20220921080844.1616883-1-matsuda-daisuke@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
In inter-subnet cases, when inbound/outbound PRs are available,
outbound_PR.dlid is used as the requestor's datapath DLID and
inbound_PR.dlid is used as the responder's DLID. The inbound_PR.dlid
is passed to responder side with the "ConnectReq.Primary_Local_Port_LID"
field. With this solution the PERMISSIVE_LID is no longer used in
Primary Local LID field.
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/b3f6cac685bce9dde37c610be82e2c19d9e51d9e.1662631201.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The responder should always use WC's SLID as the dlid, to follow the
IB SPEC section "13.5.4.2 COMMON RESPONSE ACTIONS":
A responder always takes the following actions in constructing a
response packet:
- The SLID of the received packet is used as the DLID in the response
packet.
Fixes: ac3a949fb2 ("IB/CM: Set appropriate slid and dlid when handling CM request")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/cd17c240231e059d2fc07c17dfe555d548b917eb.1662631201.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Support receiving inbound and outbound IB path records (along with GMP
PathRecord) from user-space service through the RDMA netlink channel.
The LIDs in these 3 PRs can be used in this way:
1. GMP PR: used as the standard local/remote LIDs;
2. DLID of outbound PR: Used as the "dlid" field for outbound traffic;
3. DLID of inbound PR: Used as the "dlid" field for outbound traffic in
responder side.
This is aimed to support adaptive routing. With current IB routing
solution when a packet goes out it's assigned with a fixed DLID per
target, meaning a fixed router will be used.
The LIDs in inbound/outbound path records can be used to identify group
of routers that allow communication with another subnet's entity. With
them packets from an inter-subnet connection may travel through any
router in the set to reach the target.
As confirmed with Jason, when sending a netlink request, kernel uses
LS_RESOLVE_PATH_USE_ALL so that the service knows kernel supports
multiple PRs.
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/2fa2b6c93c4c16c8915bac3cfc4f27be1d60519d.1662631201.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This fields means the total number of primary and alternative paths,
i.e.,:
0 - No primary nor alternate path is available;
1 - Only primary path is available;
2 - Both primary and alternate path are available.
Rename it to avoid confusion as with follow patches primary path will
support multiple path records.
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/cbe424de63a56207870d70c5edce7c68e45f429e.1662631201.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Fix a nested dead lock as part of ODP flow by using mmput_async().
From the below call trace [1] can see that calling mmput() once we have
the umem_odp->umem_mutex locked as required by
ib_umem_odp_map_dma_and_lock() might trigger in the same task the
exit_mmap()->__mmu_notifier_release()->mlx5_ib_invalidate_range() which
may dead lock when trying to lock the same mutex.
Moving to use mmput_async() will solve the problem as the above
exit_mmap() flow will be called in other task and will be executed once
the lock will be available.
[1]
[64843.077665] task:kworker/u133:2 state:D stack: 0 pid:80906 ppid:
2 flags:0x00004000
[64843.077672] Workqueue: mlx5_ib_page_fault mlx5_ib_eqe_pf_action [mlx5_ib]
[64843.077719] Call Trace:
[64843.077722] <TASK>
[64843.077724] __schedule+0x23d/0x590
[64843.077729] schedule+0x4e/0xb0
[64843.077735] schedule_preempt_disabled+0xe/0x10
[64843.077740] __mutex_lock.constprop.0+0x263/0x490
[64843.077747] __mutex_lock_slowpath+0x13/0x20
[64843.077752] mutex_lock+0x34/0x40
[64843.077758] mlx5_ib_invalidate_range+0x48/0x270 [mlx5_ib]
[64843.077808] __mmu_notifier_release+0x1a4/0x200
[64843.077816] exit_mmap+0x1bc/0x200
[64843.077822] ? walk_page_range+0x9c/0x120
[64843.077828] ? __cond_resched+0x1a/0x50
[64843.077833] ? mutex_lock+0x13/0x40
[64843.077839] ? uprobe_clear_state+0xac/0x120
[64843.077860] mmput+0x5f/0x140
[64843.077867] ib_umem_odp_map_dma_and_lock+0x21b/0x580 [ib_core]
[64843.077931] pagefault_real_mr+0x9a/0x140 [mlx5_ib]
[64843.077962] pagefault_mr+0xb4/0x550 [mlx5_ib]
[64843.077992] pagefault_single_data_segment.constprop.0+0x2ac/0x560
[mlx5_ib]
[64843.078022] mlx5_ib_eqe_pf_action+0x528/0x780 [mlx5_ib]
[64843.078051] process_one_work+0x22b/0x3d0
[64843.078059] worker_thread+0x53/0x410
[64843.078065] ? process_one_work+0x3d0/0x3d0
[64843.078073] kthread+0x12a/0x150
[64843.078079] ? set_kthread_struct+0x50/0x50
[64843.078085] ret_from_fork+0x22/0x30
[64843.078093] </TASK>
Fixes: 36f30e486d ("IB/core: Improve ODP to use hmm_range_fault()")
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/74d93541ea533ef7daec6f126deb1072500aeb16.1661251841.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Move the device and service_id match code at the top of
cm_insert_listen() and cm_find_listen() into the final else branch.
Link: https://lore.kernel.org/r/20220819090859.957943-4-markzhang@nvidia.com
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The service_mask is always ~cpu_to_be64(0), so the result is always
a NOP when it is &'d with a service_id. Remove it for simplicity.
Link: https://lore.kernel.org/r/20220819090859.957943-3-markzhang@nvidia.com
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Fix the order of source and destination addresses when resolving the
route between server and client to validate use of correct net device.
The reverse order we had so far didn't actually validate the net device
as the server would try to resolve the route to itself, thus always
getting the server's net device.
The issue was discovered when running cm applications on a single host
between 2 interfaces with same subnet and source based routing rules.
When resolving the reverse route the source based route rules were
ignored.
Fixes: f887f2ac87 ("IB/cma: Validate routing of incoming requests")
Link: https://lore.kernel.org/r/1c1ec2277a131d277ebcceec987fd338d35b775f.1661251872.git.leonro@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
ib_umem_dmabuf_map_pages() returns 0 on success and -ERRNO on failure.
dma_resv_wait_timeout() uses a different scheme:
* Returns -ERESTARTSYS if interrupted, 0 if the wait timed out, or
* greater than zero on success.
This results in ib_umem_dmabuf_map_pages() being non-functional as a
positive return will be understood to be an error by drivers.
Fixes: f30bceab16 ("RDMA: use dma_resv_wait() instead of extracting the fence")
Cc: stable@kernel.org
Link: https://lore.kernel.org/r/0-v1-d8f4e1fa84c8+17-rdma_dmabuf_fix_jgg@nvidia.com
Tested-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
- convert arm32 to the common dma-direct code (Arnd Bergmann, Robin Murphy,
Christoph Hellwig)
- restructure the PCIe peer to peer mapping support (Logan Gunthorpe)
- allow the IOMMU code to communicate an optional DMA mapping length
and use that in scsi and libata (John Garry)
- split the global swiotlb lock (Tianyu Lan)
- various fixes and cleanup (Chao Gao, Dan Carpenter, Dongli Zhang,
Lukas Bulwahn, Robin Murphy)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmLuIYULHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPS5A//Ty1ZNyXExmwZ6J6g7/oIvQlpAHilDr22mCd8tR8Y
Ne7TgLa/X+usFvJTxJfkvg/LNMDjD7qx0J/mhDGm4reOFcEL4/PBy0rDSOgnmntV
k/fPhgwnpuztiAQ+s+WkJ3pkrmG1HaEId7GGj2JaoYdas6RX2mGX7vL8uvUFepjw
lYPAqWMtJHkOfsDK0PqqyQsr7dcC6lyFLqnn/wqvHtTJeKCfGs6W/SIrlWme2SZY
3dNx84ZR1uPjaazAmtf2IWfjh/TBmd0ETRYycgUUKRP9iwsCkBQDBwsBGSIYXiWj
BUKQ5oMvjAlUGRF0jYz9e77KuedE6GxWiXNQstitBmid142M37DHA5tvZRf65MPS
THHcjTDmmoaO4YfFhhXOcFOrjG4/V8bF7fgHB6XkHDjhVVTcnIx8zuOAXIVBZvIV
VAALmamBqEfIZZrCqgr7hzFssK2bip+TIMkdoD46Wcr+D7bAlujhuzWxubn9+ulT
23v/pAvC80ut6LvKj6EA+GpRm/pejfOtEbjXPoO2hguNxvuUKvPQqNh9hy0q+v1e
8n2Y/4lhy5bv02S7wKooNkfCoV753jBY1TIru45UmEYc3EkTQPii6okYe0DvW4QX
VCnKgo156wSBfE+9eWdxCROv2SZqJFMV/wL3vw54dpJQMbDy7VkNsh4mGREdUkU1
uek=
=Bv19
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.20-2022-08-06' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- convert arm32 to the common dma-direct code (Arnd Bergmann, Robin
Murphy, Christoph Hellwig)
- restructure the PCIe peer to peer mapping support (Logan Gunthorpe)
- allow the IOMMU code to communicate an optional DMA mapping length
and use that in scsi and libata (John Garry)
- split the global swiotlb lock (Tianyu Lan)
- various fixes and cleanup (Chao Gao, Dan Carpenter, Dongli Zhang,
Lukas Bulwahn, Robin Murphy)
* tag 'dma-mapping-5.20-2022-08-06' of git://git.infradead.org/users/hch/dma-mapping: (45 commits)
swiotlb: fix passing local variable to debugfs_create_ulong()
dma-mapping: reformat comment to suppress htmldoc warning
PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()
RDMA/rw: drop pci_p2pdma_[un]map_sg()
RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
nvme-pci: convert to using dma_map_sgtable()
nvme-pci: check DMA ops when indicating support for PCI P2PDMA
iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg
iommu: Explicitly skip bus address marked segments in __iommu_map_sg()
dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support
dma-direct: support PCI P2PDMA pages in dma-direct map_sg
dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
PCI/P2PDMA: Attempt to set map_type if it has not been set
lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
swiotlb: clean up some coding style and minor issues
dma-mapping: update comment after dmabounce removal
scsi: sd: Add a comment about limiting max_sectors to shost optimal limit
ata: libata-scsi: cap ata_device->max_sectors according to shost->max_sectors
scsi: scsi_transport_sas: cap shost opt_sectors according to DMA optimal limit
...
This PR includes a new RDMA driver for Alibaba Cloud hardware
- Bug fixes and small features for irdma, hns, siw, qedr, hfi1, mlx5
- General spelling/grammer fixes
- rdma cm can follow changes in neighbours for control packets
- Significant amounts of rxe fixes and spec compliance changes
- Use the modern NAPI API
- Use the bitmap API instead of open coding
- Performance improvements for rtrs
- Add the ERDMA driver for Alibaba cloud
- Fix a use after free bug in SRP
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCYuwAuAAKCRCFwuHvBreF
YcRDAQC41YJNs7xve7r62/E6M+o/AXiwXa+m8rGRvcP3mdilNAEAhdom6HskenMZ
/sopeBWF78M9plLvNzWkwukaqIwrXgM=
=abuq
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This cycle we got a new RDMA driver "ERDMA" for the Alibaba cloud
environment. Otherwise the changes are dominated by rxe fixes.
There is another RDMA driver on the list that might get merged next
cycle, 'MANA' for the Azure cloud environment.
Summary:
- Bug fixes and small features for irdma, hns, siw, qedr, hfi1, mlx5
- General spelling/grammer fixes
- rdma cm can follow changes in neighbours for control packets
- Significant amounts of rxe fixes and spec compliance changes
- Use the modern NAPI API
- Use the bitmap API instead of open coding
- Performance improvements for rtrs
- Add the ERDMA driver for Alibaba cloud
- Fix a use after free bug in SRP"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (99 commits)
RDMA/ib_srpt: Unify checking rdma_cm_id condition in srpt_cm_req_recv()
RDMA/rxe: Fix error unwind in rxe_create_qp()
RDMA/mlx5: Add missing check for return value in get namespace flow
RDMA/rxe: Split qp state for requester and completer
RDMA/rxe: Generate error completion for error requester QP state
RDMA/rxe: Update wqe_index for each wqe error completion
RDMA/srpt: Fix a use-after-free
RDMA/srpt: Introduce a reference count in struct srpt_device
RDMA/srpt: Duplicate port name members
IB/qib: Fix repeated "in" within comments
RDMA/erdma: Add driver to kernel build environment
RDMA/erdma: Add the ABI definitions
RDMA/erdma: Add the erdma module
RDMA/erdma: Add connection management (CM) support
RDMA/erdma: Add verbs implementation
RDMA/erdma: Add verbs header file
RDMA/erdma: Add event queue implementation
RDMA/erdma: Add cmdq implementation
RDMA/erdma: Add main include file
RDMA/erdma: Add the hardware related definitions
...
dma_map_sg() now supports the use of P2PDMA pages so pci_p2pdma_map_sg()
is no longer necessary and may be dropped. This means the
rdma_rw_[un]map_sg() helpers are no longer necessary. Remove it all.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
cm_alloc_id_priv() allocates resource for the cm_id_priv. When
cm_init_listen() fails it doesn't free it, leading to memory leak.
Add the missing error unwind.
Fixes: 98f67156a8 ("RDMA/cm: Simplify establishing a listen cm_id")
Link: https://lore.kernel.org/r/20220621052546.4821-1-linmq006@gmail.com
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add a netevent callback for cma, mainly to catch NETEVENT_NEIGH_UPDATE.
Previously, when a system with failover MAC mechanism change its MAC address
during a CM connection attempt, the RDMA-CM would take a lot of time till
it disconnects and timesout due to the incorrect MAC address.
Now when we get a NETEVENT_NEIGH_UPDATE we check if it is due to a failover
MAC change and if so, we instantly destroy the CM and notify the user in order
to spare the unnecessary waiting for the timeout.
Link: https://lore.kernel.org/r/bb255c9e301cd50b905663b8e73f7f5133d0e4c5.1654601342.git.leonro@nvidia.com
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add to the cma, a tree that keeps track of all rdma_id_private channels
that were created while in RoCE mode.
The IDs are sorted first according to their netdevice ifindex then their
destination IP. And for IDs with matching IP they would be at the same node
in the tree, since the tree data is a list of all ids with matching destination IP.
The tree allows fast and efficient lookup of ids using an ifindex and
IP address which is useful for identifying relevant net_events promptly.
Link: https://lore.kernel.org/r/2fac52c86cc918c634ab24b3867d4aed992f54ec.1654601342.git.leonro@nvidia.com
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Small collection of incremental improvement patches:
- Minor code cleanup patches, comment improvements, etc from static tools
- Clean the some of the kernel caps, reducing the historical stealth uAPI
leftovers
- Bug fixes and minor changes for rdmavt, hns, rxe, irdma
- Remove unimplemented cruft from rxe
- Reorganize UMR QP code in mlx5 to avoid going through the IB verbs layer
- flush_workqueue(system_unbound_wq) removal
- Ensure rxe waits for objects to be unused before allowing the core to
free them
- Several rc quality bug fixes for hfi1
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCYo+NxgAKCRCFwuHvBreF
YbqSAQDJ+QolaATUvOQUPLbuLopUCJLe95VS15Kl3SNXiVUUFAEA8DLL1s6+WShd
AgypUxGHipx5BAytrn45/WiwuDeEbQ8=
=jgTl
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Small collection of incremental improvement patches:
- Minor code cleanup patches, comment improvements, etc from static
tools
- Clean the some of the kernel caps, reducing the historical stealth
uAPI leftovers
- Bug fixes and minor changes for rdmavt, hns, rxe, irdma
- Remove unimplemented cruft from rxe
- Reorganize UMR QP code in mlx5 to avoid going through the IB verbs
layer
- flush_workqueue(system_unbound_wq) removal
- Ensure rxe waits for objects to be unused before allowing the core
to free them
- Several rc quality bug fixes for hfi1"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (67 commits)
RDMA/rtrs-clt: Fix one kernel-doc comment
RDMA/hfi1: Remove all traces of diagpkt support
RDMA/hfi1: Consolidate software versions
RDMA/hfi1: Remove pointless driver version
RDMA/hfi1: Fix potential integer multiplication overflow errors
RDMA/hfi1: Prevent panic when SDMA is disabled
RDMA/hfi1: Prevent use of lock before it is initialized
RDMA/rxe: Fix an error handling path in rxe_get_mcg()
IB/core: Fix typo in comment
RDMA/core: Fix typo in comment
IB/hf1: Fix typo in comment
IB/qib: Fix typo in comment
IB/iser: Fix typo in comment
RDMA/mlx4: Avoid flush_scheduled_work() usage
IB/isert: Avoid flush_scheduled_work() usage
RDMA/mlx5: Remove duplicate pointer assignment in mlx5_ib_alloc_implicit_mr()
RDMA/qedr: Remove unnecessary synchronize_irq() before free_irq()
RDMA/hns: Use hr_reg_read() instead of remaining roce_get_xxx()
RDMA/hns: Use hr_reg_xxx() instead of remaining roce_set_xxx()
RDMA/irdma: Add SW mechanism to generate completions on error
...
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmKKlIAeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGC3oH/iPm/fLG2sJut8My
sU0RC9K+6ESV5h2Qy6k00/lqKstlu4EvBjw4V8vYpx3Q2+hbSFMn2SeWqqqT3Lkk
Zb8KINCFuuyMtdCBb42PV0zhUf5pCQF7ocm/Ae4jllDHtPmqk3WJ6IGtZBK5JBlw
z6RR/wKt0y0MRj9eZyPyYjOee2L2vuVh4tgnexK/4L8g2ZtMMRThhvUzSMWG4zxR
STYYNp0uFcfT1Vt85+ODevFH4TvdECAj+SqAegN+seHLM17YY7M0/WiIYpxGRv8P
lIpDQl4PBU8EBkpI5hkpJ/3qPincbuVOMLsYfxFtpcjjG12vGjFp2krGpS3TedZQ
3mvaJ7c=
=vLke
-----END PGP SIGNATURE-----
Merge tag 'v5.18' into rdma.git for-next
Following patches have dependencies.
Resolve the merge conflict in
drivers/net/ethernet/mellanox/mlx5/core/main.c by keeping the new names
for the fs functions following linux-next:
https://lore.kernel.org/r/20220519113529.226bc3e2@canb.auug.org.au/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
drm/drm-next has a build fix for the NewVision NV3052C panel
(drivers/gpu/drm/panel/panel-newvision-nv3052c.c), which needs to be
merged back to drm-misc-next, as it was failing to build there.
Signed-off-by: Paul Cercueil <paul@crapouillou.net>
Leon Romanovsky says:
====================
Mellanox shared branch that includes:
* Removal of FPGA TLS code https://lore.kernel.org/all/cover.1649073691.git.leonro@nvidia.com
Mellanox INNOVA TLS cards are EOL in May, 2018 [1]. As such, the code
is unmaintained, untested and not in-use by any upstream/distro oriented
customers. In order to reduce code complexity, drop the kernel code,
clean build config options and delete useless kTLS vs. TLS separation.
[1] https://network.nvidia.com/related-docs/eol/LCR-000286.pdf
* Removal of FPGA IPsec code https://lore.kernel.org/all/cover.1649232994.git.leonro@nvidia.com
Together with FPGA TLS, the IPsec went to EOL state in the November of
2019 [1]. Exactly like FPGA TLS, no active customers exist for this
upstream code and all the complexity around that area can be deleted.
[2] https://network.nvidia.com/related-docs/eol/LCR-000535.pdf
* Fix to undefined behavior from Borislav https://lore.kernel.org/all/20220405151517.29753-11-bp@alien8.de
====================
* 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux:
net/mlx5: Remove not-implemented IPsec capabilities
net/mlx5: Remove ipsec_ops function table
net/mlx5: Reduce kconfig complexity while building crypto support
net/mlx5: Move IPsec file to relevant directory
net/mlx5: Remove not-needed IPsec config
net/mlx5: Align flow steering allocation namespace to common style
net/mlx5: Unify device IPsec capabilities check
net/mlx5: Remove useless IPsec device checks
net/mlx5: Remove ipsec vs. ipsec offload file separation
RDMA/core: Delete IPsec flow action logic from the core
RDMA/mlx5: Drop crypto flow steering API
RDMA/mlx5: Delete never supported IPsec flow action
net/mlx5: Remove FPGA ipsec specific statistics
net/mlx5: Remove XFRM no_trailer flag
net/mlx5: Remove not-used IDA field from IPsec struct
net/mlx5: Delete metadata handling logic
net/mlx5_fpga: Drop INNOVA IPsec support
IB/mlx5: Fix undefined behavior due to shift overflowing the constant
net/mlx5: Cleanup kTLS function names and their exposure
net/mlx5: Remove tls vs. ktls separation as it is the same
net/mlx5: Remove indirection in TLS build
net/mlx5: Reliably return TLS device capabilities
net/mlx5_fpga: Drop INNOVA TLS support
Link: https://lore.kernel.org/r/20220409055303.1223644-1-leon@kernel.org
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
UAPI Changes:
Cross-subsystem Changes:
Core Changes:
- atomic: Add atomic_print_state to private objects
- edid: Constify the EDID parsing API, rework of the API
- dma-buf: Add dma_resv_replace_fences, dma_resv_get_singleton, make
dma_resv_excl_fence private
- format: Support monochrome formats
- fbdev: fixes for cfb_imageblit and sys_imageblit, pagelist
corruption fix
- selftests: several small fixes
- ttm: Rework bulk move handling
Driver Changes:
- Switch all relevant drivers to drm_mode_copy or drm_mode_duplicate
- bridge: conversions to devm_drm_of_get_bridge and panel_bridge,
autosuspend for analogix_dp, audio support for it66121, DSI to DPI
support for tc358767, PLL fixes and I2C support for icn6211
- bridge_connector: Enable HPD if supported
- etnaviv: fencing improvements
- gma500: GEM and GTT improvements, connector handling fixes
- komeda: switch to plane reset helper
- mediatek: MIPI DSI improvements
- omapdrm: GEM improvements
- panel: DT bindings fixes for st7735r, few fixes for ssd130x, new
panels: ltk035c5444t, B133UAN01, NV3052C
- qxl: Allow to run on arm64
- sysfb: Kconfig rework, support for VESA graphic mode selection
- vc4: Add a tracepoint for CL submissions, HDMI YUV output,
HDMI and clock improvements
- virtio: Remove restriction of non-zero blob_flags,
- vmwgfx: support for CursorMob and CursorBypass 4, various
improvements and small fixes
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRcEzekXsqa64kGDp7j7w1vZxhRxQUCYk6nlQAKCRDj7w1vZxhR
xaTTAP0ZeeXRWIYxFfmuEAUd3H4ztvr3cx/QU/85qMXQUM4gSgD/cvQHMeucrFlX
2Bafjzl/p1tQrth0HNOkSz85dABUUws=
=rJSD
-----END PGP SIGNATURE-----
Merge tag 'drm-misc-next-2022-04-07' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next for 5.19:
UAPI Changes:
Cross-subsystem Changes:
Core Changes:
- atomic: Add atomic_print_state to private objects
- edid: Constify the EDID parsing API, rework of the API
- dma-buf: Add dma_resv_replace_fences, dma_resv_get_singleton, make
dma_resv_excl_fence private
- format: Support monochrome formats
- fbdev: fixes for cfb_imageblit and sys_imageblit, pagelist
corruption fix
- selftests: several small fixes
- ttm: Rework bulk move handling
Driver Changes:
- Switch all relevant drivers to drm_mode_copy or drm_mode_duplicate
- bridge: conversions to devm_drm_of_get_bridge and panel_bridge,
autosuspend for analogix_dp, audio support for it66121, DSI to DPI
support for tc358767, PLL fixes and I2C support for icn6211
- bridge_connector: Enable HPD if supported
- etnaviv: fencing improvements
- gma500: GEM and GTT improvements, connector handling fixes
- komeda: switch to plane reset helper
- mediatek: MIPI DSI improvements
- omapdrm: GEM improvements
- panel: DT bindings fixes for st7735r, few fixes for ssd130x, new
panels: ltk035c5444t, B133UAN01, NV3052C
- qxl: Allow to run on arm64
- sysfb: Kconfig rework, support for VESA graphic mode selection
- vc4: Add a tracepoint for CL submissions, HDMI YUV output,
HDMI and clock improvements
- virtio: Remove restriction of non-zero blob_flags,
- vmwgfx: support for CursorMob and CursorBypass 4, various
improvements and small fixes
[airlied: fixup conflict with newvision panel callbacks]
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20220407085940.pnflvjojs4qw4b77@houat
The removal of mlx5 flow steering logic, left the kernel without any RDMA
drivers that implements flow action callbacks supplied by RDMA/core. Any
user access to them caused to EOPNOTSUPP error, which can be achieved by
simply removing ioctl implementation.
Link: https://lore.kernel.org/r/a638e376314a2eb1c66f597c0bbeeab2e5de7faf.1649232994.git.leonro@nvidia.com
Reviewed-by: Raed Salem <raeds@nvidia.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
To move the list iterator variable into the list_for_each_entry_*() macro
in the future it should be avoided to use the list iterator variable after
the loop body.
To *never* use the list iterator variable after the loop it was concluded
to use a separate iterator variable instead of a found boolean.
This removes the need to use a found variable and simply checking if the
variable was set, can determine if the break/goto was hit.
Link: https://lore.kernel.org/r/20220331091634.644840-1-jakobkoschel@gmail.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This change adds the dma_resv_usage enum and allows us to specify why a
dma_resv object is queried for its containing fences.
Additional to that a dma_resv_usage_rw() helper function is added to aid
retrieving the fences for a read or write userspace submission.
This is then deployed to the different query functions of the dma_resv
object and all of their users. When the write paratermer was previously
true we now use DMA_RESV_USAGE_WRITE and DMA_RESV_USAGE_READ otherwise.
v2: add KERNEL/OTHER in separate patch
v3: some kerneldoc suggestions by Daniel
v4: some more kerneldoc suggestions by Daniel, fix missing cases lost in
the rebase pointed out by Bas.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20220407085946.744568-2-christian.koenig@amd.com
Split out flags from ib_device::device_cap_flags that are only used
internally to the kernel into kernel_cap_flags that is not part of the
uapi. This limits the device_cap_flags to being the same bitmap that will
be copied to userspace.
This cleanly splits out the uverbs flags from the kernel flags to avoid
confusion in the flags bitmap.
Add some short comments describing which each of the kernel flags is
connected to. Remove unused kernel flags.
Link: https://lore.kernel.org/r/0-v2-22c19e565eef+139a-kern_caps_jgg@nvidia.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
1) Part of enum ib_device_cap_flags are used by ibv_query_device(3)
or ibv_query_device_ex(3), so we define them in
include/uapi/rdma/ib_user_verbs.h and only expose them to userspace.
2) Reformat enum ib_device_cap_flags by removing the indent before '='.
Link: https://lore.kernel.org/r/20220331032419.313904-2-yangx.jy@fujitsu.com
Signed-off-by: Xiao Yang <yangx.jy@fujitsu.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
On the passive side when the disconnectReq event comes, if the current
state is MRA_REP_RCVD, it needs to cancel the MAD before entering the
DREQ_RCVD and TIMEWAIT states, otherwise the destroy_id may block until
this mad will reach timeout.
Fixes: a977049dac ("[PATCH] IB: Add the kernel CM implementation")
Link: https://lore.kernel.org/r/75261c00c1d82128b1d981af9ff46e994186e621.1649062436.git.leonro@nvidia.com
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Patchces for the merge window:
- Minor bug fixes in mlx5, mthca, pvrdma, rtrs, mlx4, hfi1, hns
- Minor cleanups: coding style, useless includes and documentation
- Reorganize how multicast processing works in rxe
- Replace a red/black tree with xarray in rxe which improves performance
- DSCP support and HW address handle re-use in irdma
- Simplify the mailbox command handling in hns
- Simplify iser now that FMR is eliminated
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAmI7d8gACgkQOG33FX4g
mxq9TRAAm/3rEBtDyVan4Dm3mzq8xeJOTdYtQ29QNWZcwvhStNTo4c29jQJITZ9X
Xd+m8VEPWzjkkZ4+hDFM9HhWjOzqDcCCFcbF88WDO0+P1M3HYYUE6AbIcSilAaT2
GIby3+kAnsTOAryrSzHLehYJKUYJuw5NRGsVAPY3r6hz3UECjNb/X1KTWhXaIfNy
2OTqFx5wMQ6hZO8e4a2Wz4bBYAYm95UfK+TgfRqUwhDLCDnkDELvHMPh9pXxJecG
xzVg7W7Nv+6GWpUrqtM6W6YpriyjUOfbrtCJmBV3T/EdpiD6lZhKpkX23aLgu2Bi
XoMM67aZ0i1Ft2trCIF8GbLaejG6epBkeKkUMMSozKqFP5DjhNbD2f3WAlMI15iW
CcdyPS5re1ymPp66UlycTDSkN19wD8LfgQPLJCNM3EV8g8qs9LaKzEPWunBoZmw1
a+QX2zK8J07S21F8iq2sf/Qe1EDdmgOEOf7pmB/A5/PKqgXZPG7lYNYuhIKyFsdL
mO+BqWWp0a953JJjsiutxyUCyUd8jtwwxRa0B66oqSQ1jO+xj6XDxjEL8ACuT8Iq
9wFJluTCN6lfxyV8mrSfweT0fQh/W/zVF7EJZHWdKXEzuADxMp9X06wtMQ82ea+k
/wgN7DUpBZczRtlqaxt1VSkER75QwAMy/oSpA974+O120QCwLMQ=
=aux1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
- Minor bug fixes in mlx5, mthca, pvrdma, rtrs, mlx4, hfi1, hns
- Minor cleanups: coding style, useless includes and documentation
- Reorganize how multicast processing works in rxe
- Replace a red/black tree with xarray in rxe which improves performance
- DSCP support and HW address handle re-use in irdma
- Simplify the mailbox command handling in hns
- Simplify iser now that FMR is eliminated
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (93 commits)
RDMA/nldev: Prevent underflow in nldev_stat_set_counter_dynamic_doit()
IB/iser: Fix error flow in case of registration failure
IB/iser: Generalize map/unmap dma tasks
IB/iser: Use iser_fr_desc as registration context
IB/iser: Remove iser_reg_data_sg helper function
RDMA/rxe: Use standard names for ref counting
RDMA/rxe: Replace red-black trees by xarrays
RDMA/rxe: Shorten pool names in rxe_pool.c
RDMA/rxe: Move max_elem into rxe_type_info
RDMA/rxe: Replace obj by elem in declaration
RDMA/rxe: Delete _locked() APIs for pool objects
RDMA/rxe: Reverse the sense of RXE_POOL_NO_ALLOC
RDMA/rxe: Replace mr by rkey in responder resources
RDMA/rxe: Fix ref error in rxe_av.c
RDMA/hns: Use the reserved loopback QPs to free MR before destroying MPT
RDMA/irdma: Add support for address handle re-use
RDMA/qib: Fix typos in comments
RDMA/mlx5: Fix memory leak in error flow for subscribe event routine
Revert "RDMA/core: Fix ib_qp_usecnt_dec() called when error"
RDMA/rxe: Remove useless argument for update_state()
...
- Rewrite how munlock works to massively reduce the contention
on i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4ucgACgkQDpNsjXcp
gj69Wgf6AwqwmO5Tmy+fLScDPqWxmXJofbocae1kyoGHf7Ui91OK4U2j6IpvAr+g
P/vLIK+JAAcTQcrSCjymuEkf4HkGZOR03QQn7maPIEe4eLrZRQDEsmHC1L9gpeJp
s/GMvDWiGE0Tnxu0EOzfVi/yT+qjIl/S8VvqtCoJv1HdzxitZ7+1RDuqImaMC5MM
Qi3uHag78vLmCltLXpIOdpgZhdZexCdL2Y/1npf+b6FVkAJRRNUnA0gRbS7YpoVp
CbxEJcmAl9cpJLuj5i5kIfS9trr+/QcvbUlzRxh4ggC58iqnmF2V09l2MJ7YU3XL
v1O/Elq4lRhXninZFQEm9zjrri7LDQ==
=n9Ad
-----END PGP SIGNATURE-----
Merge tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
- Rewrite how munlock works to massively reduce the contention on
i_mmap_rwsem (Hugh Dickins):
https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/
- Sort out the page refcount mess for ZONE_DEVICE pages (Christoph
Hellwig):
https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/
- Convert GUP to use folios and make pincount available for order-1
pages. (Matthew Wilcox)
- Convert a few more truncation functions to use folios (Matthew
Wilcox)
- Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew
Wilcox)
- Convert rmap_walk to use folios (Matthew Wilcox)
- Convert most of shrink_page_list() to use a folio (Matthew Wilcox)
- Add support for creating large folios in readahead (Matthew Wilcox)
* tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits)
mm/damon: minor cleanup for damon_pa_young
selftests/vm/transhuge-stress: Support file-backed PMD folios
mm/filemap: Support VM_HUGEPAGE for file mappings
mm/readahead: Switch to page_cache_ra_order
mm/readahead: Align file mappings for non-DAX
mm/readahead: Add large folio readahead
mm: Support arbitrary THP sizes
mm: Make large folios depend on THP
mm: Fix READ_ONLY_THP warning
mm/filemap: Allow large folios to be added to the page cache
mm: Turn can_split_huge_page() into can_split_folio()
mm/vmscan: Convert pageout() to take a folio
mm/vmscan: Turn page_check_references() into folio_check_references()
mm/vmscan: Account large folios correctly
mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios
mm/vmscan: Free non-shmem folios without splitting them
mm/rmap: Constify the rmap_walk_control argument
mm/rmap: Convert rmap_walk() to take a folio
mm: Turn page_anon_vma() into folio_anon_vma()
mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read()
...
This code checks "index" for an upper bound but it does not check for
negatives. Change the type to unsigned to prevent underflows.
Fixes: 3c3c1f1416 ("RDMA/nldev: Allow optional-counter status configuration through RDMA netlink")
Link: https://lore.kernel.org/r/20220316083948.GC30941@kili
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ib_destroy_qp() would called by ib_create_qp_user() if error, the former
contains ib_qp_usecnt_dec(), but ib_qp_usecnt_inc() was not called before.
So move ib_qp_usecnt_inc() into create_qp().
Fixes: d2b10794fc ("RDMA/core: Create clean QP creations interface for uverbs")
Link: https://lore.kernel.org/r/20220303024232.2847388-1-yajun.deng@linux.dev
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Move the check for the actual pgmap types that need the free at refcount
one behavior into the out of line helper, and thus avoid the need to
pull memremap.h into mm.h.
Link: https://lkml.kernel.org/r/20220210072828.2930359-7-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
The rdma_zalloc_drv_obj() in __ib_alloc_pd() would zero pd, it unnecessary
add NULL to the object in struct pd.
The uverbs_free_pd() already return busy if pd->usecnt is true, there is
no need to add a warning.
Link: https://lore.kernel.org/r/20220223074901.201506-1-yajun.deng@linux.dev
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
If the state is not idle then resolve_prepare_src() should immediately
fail and no change to global state should happen. However, it
unconditionally overwrites the src_addr trying to build a temporary any
address.
For instance if the state is already RDMA_CM_LISTEN then this will corrupt
the src_addr and would cause the test in cma_cancel_operation():
if (cma_any_addr(cma_src_addr(id_priv)) && !id_priv->cma_dev)
Which would manifest as this trace from syzkaller:
BUG: KASAN: use-after-free in __list_add_valid+0x93/0xa0 lib/list_debug.c:26
Read of size 8 at addr ffff8881546491e0 by task syz-executor.1/32204
CPU: 1 PID: 32204 Comm: syz-executor.1 Not tainted 5.12.0-rc8-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x141/0x1d7 lib/dump_stack.c:120
print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:232
__kasan_report mm/kasan/report.c:399 [inline]
kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:416
__list_add_valid+0x93/0xa0 lib/list_debug.c:26
__list_add include/linux/list.h:67 [inline]
list_add_tail include/linux/list.h:100 [inline]
cma_listen_on_all drivers/infiniband/core/cma.c:2557 [inline]
rdma_listen+0x787/0xe00 drivers/infiniband/core/cma.c:3751
ucma_listen+0x16a/0x210 drivers/infiniband/core/ucma.c:1102
ucma_write+0x259/0x350 drivers/infiniband/core/ucma.c:1732
vfs_write+0x28e/0xa30 fs/read_write.c:603
ksys_write+0x1ee/0x250 fs/read_write.c:658
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
This is indicating that an rdma_id_private was destroyed without doing
cma_cancel_listens().
Instead of trying to re-use the src_addr memory to indirectly create an
any address derived from the dst build one explicitly on the stack and
bind to that as any other normal flow would do. rdma_bind_addr() will copy
it over the src_addr once it knows the state is valid.
This is similar to commit bc0bdc5afa ("RDMA/cma: Do not change
route.addr.src_addr.ss_family")
Link: https://lore.kernel.org/r/0-v2-e975c8fd9ef2+11e-syz_cma_srcaddr_jgg@nvidia.com
Cc: stable@vger.kernel.org
Fixes: 732d41c545 ("RDMA/cma: Make the locking for automatic state transition more clear")
Reported-by: syzbot+c94a3675a626f6333d74@syzkaller.appspotmail.com
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
XRC INI QPs should be able to adjust their local ACK timeout.
Fixes: 2c1619edef ("IB/cma: Define option to set ack timeout and pack tos_set")
Link: https://lore.kernel.org/r/1644421175-31943-1-git-send-email-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Suggested-by: Avneesh Pant <avneesh.pant@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In failure flow, the reference counter acquired was not released,
and the following error was reported:
drivers/infiniband/core/cm.c:3373 cm_lap_handler() warn: inconsistent
refcounting 'cm_id_priv->refcount.refs.counter':
Fixes: 7345201c39 ("IB/cm: Improve the calling of cm_init_av_for_lap and cm_init_av_by_path")
Link: https://lore.kernel.org/r/7615f23bbb5c5b66d03f6fa13e1c99d51dae6916.1642581448.git.leonro@nvidia.com
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Partially revert the commit mentioned in the Fixes line to make sure that
allocation and erasing multicast struct are locked.
BUG: KASAN: use-after-free in ucma_cleanup_multicast drivers/infiniband/core/ucma.c:491 [inline]
BUG: KASAN: use-after-free in ucma_destroy_private_ctx+0x914/0xb70 drivers/infiniband/core/ucma.c:579
Read of size 8 at addr ffff88801bb74b00 by task syz-executor.1/25529
CPU: 0 PID: 25529 Comm: syz-executor.1 Not tainted 5.16.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
__kasan_report mm/kasan/report.c:433 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
ucma_cleanup_multicast drivers/infiniband/core/ucma.c:491 [inline]
ucma_destroy_private_ctx+0x914/0xb70 drivers/infiniband/core/ucma.c:579
ucma_destroy_id+0x1e6/0x280 drivers/infiniband/core/ucma.c:614
ucma_write+0x25c/0x350 drivers/infiniband/core/ucma.c:1732
vfs_write+0x28e/0xae0 fs/read_write.c:588
ksys_write+0x1ee/0x250 fs/read_write.c:643
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Currently the xarray search can touch a concurrently freeing mc as the
xa_for_each() is not surrounded by any lock. Rather than hold the lock for
a full scan hold it only for the effected items, which is usually an empty
list.
Fixes: 95fe51096b ("RDMA/ucma: Remove mc_list and rely on xarray")
Link: https://lore.kernel.org/r/1cda5fabb1081e8d16e39a48d3a4f8160cea88b8.1642491047.git.leonro@nvidia.com
Reported-by: syzbot+e3f96c43d19782dd14a7@syzkaller.appspotmail.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In RoCE we should use cma_iboe_set_mgid() and not cma_set_mgid to generate
the mgid, otherwise we will generate an IGMP for an incorrect address.
Fixes: b5de0c60cc ("RDMA/cma: Fix use after free race in roce multicast join")
Link: https://lore.kernel.org/r/913bc6783fd7a95fe71ad9454e01653ee6fb4a9a.1642491047.git.leonro@nvidia.com
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Substantially all bug fixes and cleanups:
- Update drivers to use common helpers for GUIDs, pkeys, bitmaps,
memset_startat, and others
- General code cleanups from bots
- Simplify some of the rxe pool code in preparation for a larger rework
- Clean out old stuff from hns, including all support for hip06 devices
- Fix a bug where GID table entries could be missed if the table had holes
in it
- Rename paths and sessions in rtrs for better understandability
- Consolidate the roce source port selection code
- NDR speed support in mlx5
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAmHgct4ACgkQOG33FX4g
mxpFuQ//UqtbxowPeDB9bjJ5OLKZ1fGx0FxWkfBeR1cr0apboBNqdK1WOiz5Q7un
F2xpASNEsOCr6JMMBhHMOvNiMjRSs33GvydyBj5T7LRx/QGie+0AeSzlS314/mJs
NXvOinD21l1YEKIodw4Pfhtdl2QVmEvRpUJnccGyEGUKQ4jpUwVCTfa/tpoMVD5y
MsWqv+xOrhsmDahW2nUSXHhBIazVqYETg4EE8O7J1Lb48F98keVOdVkH5wL4nmKj
gl6oyN9lkw1sWDJBnom7mgd38L2M42mRtQkiFdMdnpj5D5jbLTcGv30GgBfyMPr6
8tI3sXcAJh3Wk3TUu2jEh2F+SjsHKRTqVjGVwQbkvEuhFK2TSHAhGC+gmP6ueZKG
diHKcJVNm6rBX6L/EczYQ7hjOiMzJLlLjhZnr8+2Lqw0X+DzQbN19RUb+XX8iqkP
ITM5LPQHf+7N8Rz2W7jcHk1h3wLv1VcKktErc6mUTHdxxpJv/XEsmLP22kqHgSyx
So6yAlMtMMMZfP6taWkpTzC6KoduFJwWARf3zYoJreeWmL18F4+Tha2th8xnQMi2
cq0UOu1WnVEFwiIzdMa3aCtTDxXQ6UgPVk1E24RaiZTEBp5hO5+Xmn56du7G89Cb
nlZbAudbh3aElbj9ptUsJGSVowGgSLJvvfgFyZz2u+wFBqdJnUk=
=EL3r
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Another small cycle. Mostly cleanups and bug fixes, quite a bit
assisted from bots. There are a few new syzkaller splats that haven't
been solved yet but they should get into the rcs in a few weeks, I
think.
Summary:
- Update drivers to use common helpers for GUIDs, pkeys, bitmaps,
memset_startat, and others
- General code cleanups from bots
- Simplify some of the rxe pool code in preparation for a larger
rework
- Clean out old stuff from hns, including all support for hip06
devices
- Fix a bug where GID table entries could be missed if the table had
holes in it
- Rename paths and sessions in rtrs for better understandability
- Consolidate the roce source port selection code
- NDR speed support in mlx5"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (83 commits)
RDMA/irdma: Remove the redundant return
RDMA/rxe: Use the standard method to produce udp source port
RDMA/irdma: Make the source udp port vary
RDMA/hns: Replace get_udp_sport with rdma_get_udp_sport
RDMA/core: Calculate UDP source port based on flow label or lqpn/rqpn
IB/qib: Fix typos
RDMA/rtrs-clt: Rename rtrs_clt to rtrs_clt_sess
RDMA/rtrs-srv: Rename rtrs_srv to rtrs_srv_sess
RDMA/rtrs-clt: Rename rtrs_clt_sess to rtrs_clt_path
RDMA/rtrs-srv: Rename rtrs_srv_sess to rtrs_srv_path
RDMA/rtrs: Rename rtrs_sess to rtrs_path
RDMA/hns: Modify the hop num of HIP09 EQ to 1
IB/iser: Align coding style across driver
IB/iser: Remove un-needed casting to/from void pointer
IB/iser: Don't suppress send completions
IB/iser: Rename ib_ret local variable
IB/iser: Fix RNR errors
IB/iser: Remove deprecated pi_guard module param
IB/mlx5: Expose NDR speed through MAD
RDMA/cxgb4: Set queue pair state when being queried
...
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmHbZ+YeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGDs4H/RgC8JOV3Dki1VtO
6OwPxUKKojhVU9LJis7kyG5voB/zE7tK5nI+jC3gYGQUFKWaZ3YY8s3UcV1zvg/b
a44b91boA+dKxEwOq4RZNQ9mU+QWnNoG5+UqBkmB8vewi3QC3T8xEmpWcERLbU7d
KrI2T6i4ksJ9OYSYMEMyrvrpt7nt3n1tDX8b71faXjf1zbLeGo9zT53t6BJ/LknV
AK406Eq/3bg36OZrKFuG7hCJfRE/cSlxF9bxK3sIfMBMQ2YPe1S5+pxl5iBD0nyl
NaHOBYcLTxPAne3YgIvK0zDdsS+EtPSlaVdWfSmNjQhX2vqEixldgdrOCmwp37vd
3gV9D28=
=hrOo
-----END PGP SIGNATURE-----
Merge tag 'v5.16' into rdma.git for-next
To resolve minor conflict in:
drivers/infiniband/hw/mlx5/mlx5_ib.h
By merging both hunks.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Core
----
- Defer freeing TCP skbs to the BH handler, whenever possible,
or at least perform the freeing outside of the socket lock section
to decrease cross-CPU allocator work and improve latency.
- Add netdevice refcount tracking to locate sources of netdevice
and net namespace refcount leaks.
- Make Tx watchdog less intrusive - avoid pausing Tx and restarting
all queues from a single CPU removing latency spikes.
- Various small optimizations throughout the stack from Eric Dumazet.
- Make netdev->dev_addr[] constant, force modifications to go via
appropriate helpers to allow us to keep addresses in ordered data
structures.
- Replace unix_table_lock with per-hash locks, improving performance
of bind() calls.
- Extend skb drop tracepoint with a drop reason.
- Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW.
BPF
---
- New helpers:
- bpf_find_vma(), find and inspect VMAs for profiling use cases
- bpf_loop(), runtime-bounded loop helper trading some execution
time for much faster (if at all converging) verification
- bpf_strncmp(), improve performance, avoid compiler flakiness
- bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt()
for tracing programs, all inlined by the verifier
- Support BPF relocations (CO-RE) in the kernel loader.
- Further the support for BTF_TYPE_TAG annotations.
- Allow access to local storage in sleepable helpers.
- Convert verifier argument types to a composable form with different
attributes which can be shared across types (ro, maybe-null).
- Prepare libbpf for upcoming v1.0 release by cleaning up APIs,
creating new, extensible ones where missing and deprecating those
to be removed.
Protocols
---------
- WiFi (mac80211/cfg80211):
- notify user space about long "come back in N" AP responses,
allow it to react to such temporary rejections
- allow non-standard VHT MCS 10/11 rates
- use coarse time in airtime fairness code to save CPU cycles
- Bluetooth:
- rework of HCI command execution serialization to use a common
queue and work struct, and improve handling errors reported
in the middle of a batch of commands
- rework HCI event handling to use skb_pull_data, avoiding packet
parsing pitfalls
- support AOSP Bluetooth Quality Report
- SMC:
- support net namespaces, following the RDMA model
- improve connection establishment latency by pre-clearing buffers
- introduce TCP ULP for automatic redirection to SMC
- Multi-Path TCP:
- support ioctls: SIOCINQ, OUTQ, and OUTQNSD
- support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT,
IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY
- support cmsgs: TCP_INQ
- improvements in the data scheduler (assigning data to subflows)
- support fastclose option (quick shutdown of the full MPTCP
connection, similar to TCP RST in regular TCP)
- MCTP (Management Component Transport) over serial, as defined by
DMTF spec DSP0253 - "MCTP Serial Transport Binding".
Driver API
----------
- Support timestamping on bond interfaces in active/passive mode.
- Introduce generic phylink link mode validation for drivers which
don't have any quirks and where MAC capability bits fully express
what's supported. Allow PCS layer to participate in the validation.
Convert a number of drivers.
- Add support to set/get size of buffers on the Rx rings and size of
the tx copybreak buffer via ethtool.
- Support offloading TC actions as first-class citizens rather than
only as attributes of filters, improve sharing and device resource
utilization.
- WiFi (mac80211/cfg80211):
- support forwarding offload (ndo_fill_forward_path)
- support for background radar detection hardware
- SA Query Procedures offload on the AP side
New hardware / drivers
----------------------
- tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with
real-time requirements for isochronous communication with protocols
like OPC UA Pub/Sub.
- Qualcomm BAM-DMUX WWAN - driver for data channels of modems
integrated into many older Qualcomm SoCs, e.g. MSM8916 or
MSM8974 (qcom_bam_dmux).
- Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch
driver with support for bridging, VLANs and multicast forwarding
(lan966x).
- iwlmei driver for co-operating between Intel's WiFi driver and
Intel's Active Management Technology (AMT) devices.
- mse102x - Vertexcom MSE102x Homeplug GreenPHY chips
- Bluetooth:
- MediaTek MT7921 SDIO devices
- Foxconn MT7922A
- Realtek RTL8852AE
Drivers
-------
- Significantly improve performance in the datapaths of:
lan78xx, ax88179_178a, lantiq_xrx200, bnxt.
- Intel Ethernet NICs:
- igb: support PTP/time PEROUT and EXTTS SDP functions on
82580/i354/i350 adapters
- ixgbevf: new PF -> VF mailbox API which avoids the risk of
mailbox corruption with ESXi
- iavf: support configuration of VLAN features of finer granularity,
stacked tags and filtering
- ice: PTP support for new E822 devices with sub-ns precision
- ice: support firmware activation without reboot
- Mellanox Ethernet NICs (mlx5):
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- support TC forwarding when tunnel encap and decap happen between
two ports of the same NIC
- dynamically size and allow disabling various features to save
resources for running in embedded / SmartNIC scenarios
- Broadcom Ethernet NICs (bnxt):
- use page frag allocator to improve Rx performance
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- Other Ethernet NICs:
- amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support
- Microsoft cloud/virtual NIC (mana):
- add XDP support (PASS, DROP, TX)
- Mellanox Ethernet switches (mlxsw):
- initial support for Spectrum-4 ASICs
- VxLAN with IPv6 underlay
- Marvell Ethernet switches (prestera):
- support flower flow templates
- add basic IP forwarding support
- NXP embedded Ethernet switches (ocelot & felix):
- support Per-Stream Filtering and Policing (PSFP)
- enable cut-through forwarding between ports by default
- support FDMA to improve packet Rx/Tx to CPU
- Other embedded switches:
- hellcreek: improve trapping management (STP and PTP) packets
- qca8k: support link aggregation and port mirroring
- Qualcomm 802.11ax WiFi (ath11k):
- qca6390, wcn6855: enable 802.11 power save mode in station mode
- BSS color change support
- WCN6855 hw2.1 support
- 11d scan offload support
- scan MAC address randomization support
- full monitor mode, only supported on QCN9074
- qca6390/wcn6855: report signal and tx bitrate
- qca6390: rfkill support
- qca6390/wcn6855: regdb.bin support
- Intel WiFi (iwlwifi):
- support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS)
in cooperation with the BIOS
- support for Optimized Connectivity Experience (OCE) scan
- support firmware API version 68
- lots of preparatory work for the upcoming Bz device family
- MediaTek WiFi (mt76):
- Specific Absorption Rate (SAR) support
- mt7921: 160 MHz channel support
- RealTek WiFi (rtw88):
- Specific Absorption Rate (SAR) support
- scan offload
- Other WiFi NICs
- ath10k: support fetching (pre-)calibration data from nvmem
- brcmfmac: configure keep-alive packet on suspend
- wcn36xx: beacon filter support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmHbkZAACgkQMUZtbf5S
IruYkQ//XX7BggcwBfukPK83j0dONolClijqKcKR08g4vB5L8GXvv6OErKIWrh4k
h8JanCH352ZkbCSw3MvFdm825UYQv8vPMd6Qks/LJ4aSKqCuy4MIlAo+yOw4Km3O
i7++lRfma6DqHHI59wvLjWoxZSPu8lL+rI8UsZ5qMOlnNlGAOXsNrzRjaqQ3FddY
AMxZeBUtrPqUCCQZFq3U8apkYzUp7CA/3XR9zRcja3uPbrtOV2G+4whRF90qGNWz
Tm/QvJ9F/Ab292cbhxR4KuaQ3hUhaCQyDjbZk3+FZzZpAVhYTVqcNjny6+yXmbiP
NXRtwemnl1NlWKMnJM8lEeY48u626tRIkxA/Wtd61uoO5uKUSxfGP+UpUi+DfXbF
yIw50VQ7L2bpxXP/HjtmhVgZDaWKYyh22Zw4Hp/muMJz0hgUB0KODY3tf2jUWbjJ
0oEgocWyzhhwMQKqupTDCIaRgIs2ewYr4ZrFDhI3HnHC/vv1VjoPRUPIyxwppD2N
cXvZb3B1sWK8iX5gCbISGzyU4bB7I0rvJSTU42ueti7n6NqRFZ79qHQpYnnY+JdO
z1qOwY/d/yWfBoXVKRtRg2qz6CdEt5BQklwAgVEBgrFpf58gp694EwGMb1htY14J
r/k9bVpmyIFpUnBH2CPMRfBVA3tUTqzyzzFV4AMw40NYLKmhLdo=
=KLm3
-----END PGP SIGNATURE-----
Merge tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core
----
- Defer freeing TCP skbs to the BH handler, whenever possible, or at
least perform the freeing outside of the socket lock section to
decrease cross-CPU allocator work and improve latency.
- Add netdevice refcount tracking to locate sources of netdevice and
net namespace refcount leaks.
- Make Tx watchdog less intrusive - avoid pausing Tx and restarting
all queues from a single CPU removing latency spikes.
- Various small optimizations throughout the stack from Eric Dumazet.
- Make netdev->dev_addr[] constant, force modifications to go via
appropriate helpers to allow us to keep addresses in ordered data
structures.
- Replace unix_table_lock with per-hash locks, improving performance
of bind() calls.
- Extend skb drop tracepoint with a drop reason.
- Allow SO_MARK and SO_PRIORITY setsockopt under CAP_NET_RAW.
BPF
---
- New helpers:
- bpf_find_vma(), find and inspect VMAs for profiling use cases
- bpf_loop(), runtime-bounded loop helper trading some execution
time for much faster (if at all converging) verification
- bpf_strncmp(), improve performance, avoid compiler flakiness
- bpf_get_func_arg(), bpf_get_func_ret(), bpf_get_func_arg_cnt()
for tracing programs, all inlined by the verifier
- Support BPF relocations (CO-RE) in the kernel loader.
- Further the support for BTF_TYPE_TAG annotations.
- Allow access to local storage in sleepable helpers.
- Convert verifier argument types to a composable form with different
attributes which can be shared across types (ro, maybe-null).
- Prepare libbpf for upcoming v1.0 release by cleaning up APIs,
creating new, extensible ones where missing and deprecating those
to be removed.
Protocols
---------
- WiFi (mac80211/cfg80211):
- notify user space about long "come back in N" AP responses,
allow it to react to such temporary rejections
- allow non-standard VHT MCS 10/11 rates
- use coarse time in airtime fairness code to save CPU cycles
- Bluetooth:
- rework of HCI command execution serialization to use a common
queue and work struct, and improve handling errors reported in
the middle of a batch of commands
- rework HCI event handling to use skb_pull_data, avoiding packet
parsing pitfalls
- support AOSP Bluetooth Quality Report
- SMC:
- support net namespaces, following the RDMA model
- improve connection establishment latency by pre-clearing buffers
- introduce TCP ULP for automatic redirection to SMC
- Multi-Path TCP:
- support ioctls: SIOCINQ, OUTQ, and OUTQNSD
- support socket options: IP_TOS, IP_FREEBIND, IP_TRANSPARENT,
IPV6_FREEBIND, and IPV6_TRANSPARENT, TCP_CORK and TCP_NODELAY
- support cmsgs: TCP_INQ
- improvements in the data scheduler (assigning data to subflows)
- support fastclose option (quick shutdown of the full MPTCP
connection, similar to TCP RST in regular TCP)
- MCTP (Management Component Transport) over serial, as defined by
DMTF spec DSP0253 - "MCTP Serial Transport Binding".
Driver API
----------
- Support timestamping on bond interfaces in active/passive mode.
- Introduce generic phylink link mode validation for drivers which
don't have any quirks and where MAC capability bits fully express
what's supported. Allow PCS layer to participate in the validation.
Convert a number of drivers.
- Add support to set/get size of buffers on the Rx rings and size of
the tx copybreak buffer via ethtool.
- Support offloading TC actions as first-class citizens rather than
only as attributes of filters, improve sharing and device resource
utilization.
- WiFi (mac80211/cfg80211):
- support forwarding offload (ndo_fill_forward_path)
- support for background radar detection hardware
- SA Query Procedures offload on the AP side
New hardware / drivers
----------------------
- tsnep - FPGA based TSN endpoint Ethernet MAC used in PLCs with
real-time requirements for isochronous communication with protocols
like OPC UA Pub/Sub.
- Qualcomm BAM-DMUX WWAN - driver for data channels of modems
integrated into many older Qualcomm SoCs, e.g. MSM8916 or MSM8974
(qcom_bam_dmux).
- Microchip LAN966x multi-port Gigabit AVB/TSN Ethernet Switch driver
with support for bridging, VLANs and multicast forwarding
(lan966x).
- iwlmei driver for co-operating between Intel's WiFi driver and
Intel's Active Management Technology (AMT) devices.
- mse102x - Vertexcom MSE102x Homeplug GreenPHY chips
- Bluetooth:
- MediaTek MT7921 SDIO devices
- Foxconn MT7922A
- Realtek RTL8852AE
Drivers
-------
- Significantly improve performance in the datapaths of: lan78xx,
ax88179_178a, lantiq_xrx200, bnxt.
- Intel Ethernet NICs:
- igb: support PTP/time PEROUT and EXTTS SDP functions on
82580/i354/i350 adapters
- ixgbevf: new PF -> VF mailbox API which avoids the risk of
mailbox corruption with ESXi
- iavf: support configuration of VLAN features of finer
granularity, stacked tags and filtering
- ice: PTP support for new E822 devices with sub-ns precision
- ice: support firmware activation without reboot
- Mellanox Ethernet NICs (mlx5):
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- support TC forwarding when tunnel encap and decap happen between
two ports of the same NIC
- dynamically size and allow disabling various features to save
resources for running in embedded / SmartNIC scenarios
- Broadcom Ethernet NICs (bnxt):
- use page frag allocator to improve Rx performance
- expose control over IRQ coalescing mode (CQE vs EQE) via ethtool
- Other Ethernet NICs:
- amd-xgbe: add Ryzen 6000 (Yellow Carp) Ethernet support
- Microsoft cloud/virtual NIC (mana):
- add XDP support (PASS, DROP, TX)
- Mellanox Ethernet switches (mlxsw):
- initial support for Spectrum-4 ASICs
- VxLAN with IPv6 underlay
- Marvell Ethernet switches (prestera):
- support flower flow templates
- add basic IP forwarding support
- NXP embedded Ethernet switches (ocelot & felix):
- support Per-Stream Filtering and Policing (PSFP)
- enable cut-through forwarding between ports by default
- support FDMA to improve packet Rx/Tx to CPU
- Other embedded switches:
- hellcreek: improve trapping management (STP and PTP) packets
- qca8k: support link aggregation and port mirroring
- Qualcomm 802.11ax WiFi (ath11k):
- qca6390, wcn6855: enable 802.11 power save mode in station mode
- BSS color change support
- WCN6855 hw2.1 support
- 11d scan offload support
- scan MAC address randomization support
- full monitor mode, only supported on QCN9074
- qca6390/wcn6855: report signal and tx bitrate
- qca6390: rfkill support
- qca6390/wcn6855: regdb.bin support
- Intel WiFi (iwlwifi):
- support SAR GEO Offset Mapping (SGOM) and Time-Aware-SAR (TAS)
in cooperation with the BIOS
- support for Optimized Connectivity Experience (OCE) scan
- support firmware API version 68
- lots of preparatory work for the upcoming Bz device family
- MediaTek WiFi (mt76):
- Specific Absorption Rate (SAR) support
- mt7921: 160 MHz channel support
- RealTek WiFi (rtw88):
- Specific Absorption Rate (SAR) support
- scan offload
- Other WiFi NICs
- ath10k: support fetching (pre-)calibration data from nvmem
- brcmfmac: configure keep-alive packet on suspend
- wcn36xx: beacon filter support"
* tag '5.17-net-next' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2048 commits)
tcp: tcp_send_challenge_ack delete useless param `skb`
net/qla3xxx: Remove useless DMA-32 fallback configuration
rocker: Remove useless DMA-32 fallback configuration
hinic: Remove useless DMA-32 fallback configuration
lan743x: Remove useless DMA-32 fallback configuration
net: enetc: Remove useless DMA-32 fallback configuration
cxgb4vf: Remove useless DMA-32 fallback configuration
cxgb4: Remove useless DMA-32 fallback configuration
cxgb3: Remove useless DMA-32 fallback configuration
bnx2x: Remove useless DMA-32 fallback configuration
et131x: Remove useless DMA-32 fallback configuration
be2net: Remove useless DMA-32 fallback configuration
vmxnet3: Remove useless DMA-32 fallback configuration
bna: Simplify DMA setting
net: alteon: Simplify DMA setting
myri10ge: Simplify DMA setting
qlcnic: Simplify DMA setting
net: allwinner: Fix print format
page_pool: remove spinlock in page_pool_refill_alloc_cache()
amt: fix wrong return type of amt_send_membership_update()
...
If dst->is_global field is not set, the GRH fields are not cleared
and the following infoleak is reported.
=====================================================
BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:121 [inline]
BUG: KMSAN: kernel-infoleak in _copy_to_user+0x1c9/0x270 lib/usercopy.c:33
instrument_copy_to_user include/linux/instrumented.h:121 [inline]
_copy_to_user+0x1c9/0x270 lib/usercopy.c:33
copy_to_user include/linux/uaccess.h:209 [inline]
ucma_init_qp_attr+0x8c7/0xb10 drivers/infiniband/core/ucma.c:1242
ucma_write+0x637/0x6c0 drivers/infiniband/core/ucma.c:1732
vfs_write+0x8ce/0x2030 fs/read_write.c:588
ksys_write+0x28b/0x510 fs/read_write.c:643
__do_sys_write fs/read_write.c:655 [inline]
__se_sys_write fs/read_write.c:652 [inline]
__ia32_sys_write+0xdb/0x120 fs/read_write.c:652
do_syscall_32_irqs_on arch/x86/entry/common.c:114 [inline]
__do_fast_syscall_32+0x96/0xf0 arch/x86/entry/common.c:180
do_fast_syscall_32+0x34/0x70 arch/x86/entry/common.c:205
do_SYSENTER_32+0x1b/0x20 arch/x86/entry/common.c:248
entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
Local variable resp created at:
ucma_init_qp_attr+0xa4/0xb10 drivers/infiniband/core/ucma.c:1214
ucma_write+0x637/0x6c0 drivers/infiniband/core/ucma.c:1732
Bytes 40-59 of 144 are uninitialized
Memory access of size 144 starts at ffff888167523b00
Data copied to user address 0000000020000100
CPU: 1 PID: 25910 Comm: syz-executor.1 Not tainted 5.16.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
=====================================================
Fixes: 4ba66093bd ("IB/core: Check for global flag when using ah_attr")
Link: https://lore.kernel.org/r/0e9dd51f93410b7b2f4f5562f52befc878b71afa.1641298868.git.leonro@nvidia.com
Reported-by: syzbot+6d532fa8f9463da290bc@syzkaller.appspotmail.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There are currently 2 ways to create a set of sysfs files for a kobj_type,
through the default_attrs field, and the default_groups field. Move the
IB code to use default_groups field which has been the preferred way since
commit aa30f47cf6 ("kobject: Add support for default attribute groups to
kobj_type") so that we can soon get rid of the obsolete default_attrs
field.
Link: https://lore.kernel.org/r/20220103152259.531034-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Because of the possible failure of the allocation, data might be NULL
pointer and will cause the dereference of the NULL pointer later.
Therefore, it might be better to check it and return -ENOMEM.
Fixes: 6884c6c4bd ("RDMA/verbs: Store the write/write_ex uapi entry points in the uverbs_api")
Link: https://lore.kernel.org/r/20211231093315.1917667-1-jiasheng@iscas.ac.cn
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
sock.h is pretty heavily used (5k objects rebuilt on x86 after
it's touched). We can drop the include of filter.h from it and
add a forward declaration of struct sk_filter instead.
This decreases the number of rebuilt objects when bpf.h
is touched from ~5k to ~1k.
There's a lot of missing includes this was masking. Primarily
in networking tho, this time.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/bpf/20211229004913.513372-1-kuba@kernel.org
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmG2fU0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGC7EH/3R7Rt+OD8Wn8Ss3
w8V+dBxVwa2u2oMTyUHPxaeOXZ7bi38XlUdLFPOK/76bGwO0a5TmYZqsWdRbGyT0
HfcYjHsQ0lbJXk/nh2oM47oJxJXVpThIHXJEk0FZ0Y5t+DYjIYlNHzqZymUyhLem
St74zgWcyT+MXuqY34vB827FJDUnOxhhhi85tObeunaSPAomy9aiYidSC1ARREnz
iz2VUntP/QnRnKVvL2nUZNzcz1xL5vfCRSKsRGRSv3qW1Y/1M71ylt6JVmSftWq+
VmMdFxFhdrb1OK/1ct/930Un/UP2NG9EJsWxote2XYlnVSZHzDqH7lUhbqgdCcLz
1m2tVNY=
=7wRd
-----END PGP SIGNATURE-----
Merge tag 'v5.16-rc5' into rdma.git for-next
Required due to dependencies in following patches.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently, when cma_resolve_ib_dev() searches for a matching GID it will
stop searching after encountering the first empty GID table entry. This
behavior is wrong since neither IB nor RoCE spec enforce tightly packed
GID tables.
For example, when the matching valid GID entry exists at index N, and if a
GID entry is empty at index N-1, cma_resolve_ib_dev() will fail to find
the matching valid entry.
Fix it by making cma_resolve_ib_dev() continue searching even after
encountering missing entries.
Fixes: f17df3b0de ("RDMA/cma: Add support for AF_IB to rdma_resolve_addr()")
Link: https://lore.kernel.org/r/b7346307e3bb396c43d67d924348c6c496493991.1639055490.git.leonro@nvidia.com
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Currently, ib_find_gid() will stop searching after encountering the first
empty GID table entry. This behavior is wrong since neither IB nor RoCE
spec enforce tightly packed GID tables.
For example, when a valid GID entry exists at index N, and if a GID entry
is empty at index N-1, ib_find_gid() will fail to find the valid entry.
Fix it by making ib_find_gid() continue searching even after encountering
missing entries.
Fixes: 5eb620c81c ("IB/core: Add helpers for uncached GID and P_Key searches")
Link: https://lore.kernel.org/r/e55d331b96cecfc2cf19803d16e7109ea966882d.1639055490.git.leonro@nvidia.com
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Modify rdma_query_gid() to return -ENOENT for empty entries. This will
make error reporting more accurate and will be used in next patches.
Link: https://lore.kernel.org/r/1f2b65dfb4d995e74b621e3e21e7c7445d187956.1639055490.git.leonro@nvidia.com
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The struct member variable create_flags is assigned twice. Remove the
unnecessary assignment.
Fixes: ece9ca97cc ("RDMA/uverbs: Do not check the input length on create_cq/qp paths")
Link: https://lore.kernel.org/r/20211207064607.541695-1-yanjun.zhu@linux.dev
Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The existing tests are a little hard to comprehend. Use
check_add_overflow() instead.
Fixes: 04ded16724 ("RDMA/cma: Verify private data length")
Link: https://lore.kernel.org/r/1637661978-18770-1-git-send-email-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Preset both receive and send CQ pointers prior to call to the drivers and
overwrite it later again till the mlx4 is going to be changed do not
overwrite ibqp properties.
This change is needed for mlx5, because in case of QP creation failure, it
will go to the path of QP destroy which relies on proper CQ pointers.
BUG: KASAN: use-after-free in create_qp.cold+0x164/0x16e [mlx5_ib]
Write of size 8 at addr ffff8880064c55c0 by task a.out/246
CPU: 0 PID: 246 Comm: a.out Not tainted 5.15.0+ #291
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack_lvl+0x45/0x59
print_address_description.constprop.0+0x1f/0x140
kasan_report.cold+0x83/0xdf
create_qp.cold+0x164/0x16e [mlx5_ib]
mlx5_ib_create_qp+0x358/0x28a0 [mlx5_ib]
create_qp.part.0+0x45b/0x6a0 [ib_core]
ib_create_qp_user+0x97/0x150 [ib_core]
ib_uverbs_handler_UVERBS_METHOD_QP_CREATE+0x92c/0x1250 [ib_uverbs]
ib_uverbs_cmd_verbs+0x1c38/0x3150 [ib_uverbs]
ib_uverbs_ioctl+0x169/0x260 [ib_uverbs]
__x64_sys_ioctl+0x866/0x14d0
do_syscall_64+0x3d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Allocated by task 246:
kasan_save_stack+0x1b/0x40
__kasan_kmalloc+0xa4/0xd0
create_qp.part.0+0x92/0x6a0 [ib_core]
ib_create_qp_user+0x97/0x150 [ib_core]
ib_uverbs_handler_UVERBS_METHOD_QP_CREATE+0x92c/0x1250 [ib_uverbs]
ib_uverbs_cmd_verbs+0x1c38/0x3150 [ib_uverbs]
ib_uverbs_ioctl+0x169/0x260 [ib_uverbs]
__x64_sys_ioctl+0x866/0x14d0
do_syscall_64+0x3d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Freed by task 246:
kasan_save_stack+0x1b/0x40
kasan_set_track+0x1c/0x30
kasan_set_free_info+0x20/0x30
__kasan_slab_free+0x10c/0x150
slab_free_freelist_hook+0xb4/0x1b0
kfree+0xe7/0x2a0
create_qp.part.0+0x52b/0x6a0 [ib_core]
ib_create_qp_user+0x97/0x150 [ib_core]
ib_uverbs_handler_UVERBS_METHOD_QP_CREATE+0x92c/0x1250 [ib_uverbs]
ib_uverbs_cmd_verbs+0x1c38/0x3150 [ib_uverbs]
ib_uverbs_ioctl+0x169/0x260 [ib_uverbs]
__x64_sys_ioctl+0x866/0x14d0
do_syscall_64+0x3d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: 514aee660d ("RDMA: Globally allocate and release QP memory")
Link: https://lore.kernel.org/r/2dbb2e2cbb1efb188a500e5634be1d71956424ce.1636631035.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Here is the big set of char and misc and other tiny driver subsystem
updates for 5.16-rc1.
Loads of things in here, all of which have been in linux-next for a
while with no reported problems (except for one called out below.)
Included are:
- habanana labs driver updates, including dma_buf usage,
reviewed and acked by the dma_buf maintainers
- iio driver update (going through this tree not staging as they
really do not belong going through that tree anymore)
- counter driver updates
- hwmon driver updates that the counter drivers needed, acked by
the hwmon maintainer
- xillybus driver updates
- binder driver updates
- extcon driver updates
- dma_buf module namespaces added (will cause a build error in
arm64 for allmodconfig, but that change is on its way through
the drm tree)
- lkdtm driver updates
- pvpanic driver updates
- phy driver updates
- virt acrn and nitr_enclaves driver updates
- smaller char and misc driver updates
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYYPX2A8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymUUgCbB4EKysgLuXYdjUalZDx+vvZO4k0AniS14O4k
F+2dVSZ5WX6wumUzCaA6
=bXQM
-----END PGP SIGNATURE-----
Merge tag 'char-misc-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the big set of char and misc and other tiny driver subsystem
updates for 5.16-rc1.
Loads of things in here, all of which have been in linux-next for a
while with no reported problems (except for one called out below.)
Included are:
- habanana labs driver updates, including dma_buf usage, reviewed and
acked by the dma_buf maintainers
- iio driver update (going through this tree not staging as they
really do not belong going through that tree anymore)
- counter driver updates
- hwmon driver updates that the counter drivers needed, acked by the
hwmon maintainer
- xillybus driver updates
- binder driver updates
- extcon driver updates
- dma_buf module namespaces added (will cause a build error in arm64
for allmodconfig, but that change is on its way through the drm
tree)
- lkdtm driver updates
- pvpanic driver updates
- phy driver updates
- virt acrn and nitr_enclaves driver updates
- smaller char and misc driver updates"
* tag 'char-misc-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (386 commits)
comedi: dt9812: fix DMA buffers on stack
comedi: ni_usb6501: fix NULL-deref in command paths
arm64: errata: Enable TRBE workaround for write to out-of-range address
arm64: errata: Enable workaround for TRBE overwrite in FILL mode
coresight: trbe: Work around write to out of range
coresight: trbe: Make sure we have enough space
coresight: trbe: Add a helper to determine the minimum buffer size
coresight: trbe: Workaround TRBE errata overwrite in FILL mode
coresight: trbe: Add infrastructure for Errata handling
coresight: trbe: Allow driver to choose a different alignment
coresight: trbe: Decouple buffer base from the hardware base
coresight: trbe: Add a helper to pad a given buffer area
coresight: trbe: Add a helper to calculate the trace generated
coresight: trbe: Defer the probe on offline CPUs
coresight: trbe: Fix incorrect access of the sink specific data
coresight: etm4x: Add ETM PID for Kryo-5XX
coresight: trbe: Prohibit trace before disabling TRBE
coresight: trbe: End the AUX handle on truncation
coresight: trbe: Do not truncate buffer on IRQ
coresight: trbe: Fix handling of spurious interrupts
...
If the driver returns a new MR during rereg it has to fill it with the
IOVA from the proper source. If IB_MR_REREG_TRANS is set then the IOVA is
cmd.hca_va, otherwise the IOVA comes from the old MR. mlx5 for example has
two calls inside rereg_mr:
return create_real_mr(new_pd, umem, mr->ibmr.iova,
new_access_flags);
and
return create_real_mr(new_pd, new_umem, iova, new_access_flags);
Unconditionally overwriting the iova in the newly allocated MR will
corrupt the iova if the first path is used.
Remove the redundant initializations from ib_uverbs_rereg_mr().
Fixes: 6e0954b11c ("RDMA/uverbs: Allow drivers to create a new HW object during rereg_mr")
Link: https://lore.kernel.org/r/4b0a31bbc372842613286a10d7a8cbb0ee6069c7.1635400472.git.leonro@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmF/AjYeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG1hkIAJ6sFDbvb4M4LMwf
Slh2NVL9o5sLMBDzVwnVlyMSKDbMn1WBKreGssaLgZjGDc74lxsdSmw5l9MZm0JN
xlq95Q6XFiuu+0qDHPWwfDz3JFO4TqW2ZLLPWk9NnkNbRXqccSrlVRi1RpgE1t3/
NUtS8CQLu6A2BYMc6mkk3aV6IwSNKOkWbM5eBHSvU4j8B6lLbNQop0AfO/wyY1xB
U6LiVE1RpN/b7Yv+75ITtNzuHzVIBx6305FvSnOlKbMKKvIClt96Vd2OeuoEkK+6
wGU8JraB1+fc0GckAhynNrjWQWdvi0MAhFWWEJxjS20OGcV1rXDduNfkVNauO1Zn
+dNyJ3s=
=g9fz
-----END PGP SIGNATURE-----
Merge tag 'v5.15' into rdma.git for-next
Pull in the accepted for-rc patches as the next merge needs a newer base.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
alloc_and_bind() creates a new rdma_hw_stats structure but misses
initializing the mutex lock.
This causes debug kernel failures:
DEBUG_LOCKS_WARN_ON(lock->magic != lock)
WARNING: CPU: 4 PID: 64464 at kernel/locking/mutex.c:575 __mutex_lock+0x9c3/0x12b0
Call Trace:
fill_res_counter_entry+0x6ee/0x1020 [ib_core]
res_get_common_dumpit+0x907/0x10a0 [ib_core]
nldev_stat_get_dumpit+0x20a/0x290 [ib_core]
netlink_dump+0x451/0x1040
__netlink_dump_start+0x583/0x830
rdma_nl_rcv_msg+0x3f3/0x7c0 [ib_core]
rdma_nl_rcv+0x264/0x410 [ib_core]
netlink_unicast+0x433/0x700
netlink_sendmsg+0x707/0xbf0
sock_sendmsg+0xb0/0xe0
__sys_sendto+0x193/0x240
__x64_sys_sendto+0xdd/0x1b0
do_syscall_64+0x3d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Instead of requiring all users to open code initialization of the lock put
it in the general rdma_alloc_hw_stats_struct() function and remove
duplicates.
Fixes: c4ffee7c9b ("RDMA/netlink: Implement counter dumpit calback")
Link: https://lore.kernel.org/r/4a22986c4685058d2c735d91703ee7d865815bb9.1635237668.git.leonro@nvidia.com
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Introduce ib_umem_dmabuf_get_pinned() which allows the driver to get a
dmabuf umem which is pinned and does not require move_notify callback
implementation.
The returned umem is pinned and DMA mapped like standard cpu umems, and is
released through ib_umem_release() (incl. unpinning and unmapping).
Link: https://lore.kernel.org/r/20211012120903.96933-3-galpress@amazon.com
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In order to better track where in the kernel the dma-buf code is used,
put the symbols in the namespace DMA_BUF and modify all users of the
symbols to properly import the namespace to not break the build at the
same time.
Now the output of modinfo shows the use of these symbols, making it
easier to watch for users over time:
$ modinfo drivers/misc/fastrpc.ko | grep import
import_ns: DMA_BUF
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: dri-devel@lists.freedesktop.org
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Sumit Semwal <sumit.semwal@linaro.org>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20211010124628.17691-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The 'struct attribute' flex array contains some struct lock_class_key's
which become big when lockdep is turned on. Big enough that some drivers
will not load when CONFIG_PROVE_LOCKING=y because they cannot allocate
enough memory:
WARNING: CPU: 36 PID: 8 at mm/page_alloc.c:5350 __alloc_pages+0x27e/0x3e0
Call Trace:
kmalloc_order+0x2a/0xb0
kmalloc_order_trace+0x19/0xf0
__kmalloc+0x231/0x270
ib_setup_port_attrs+0xd8/0x870 [ib_core]
ib_register_device+0x419/0x4e0 [ib_core]
bnxt_re_task+0x208/0x2d0 [bnxt_re]
Link: https://lore.kernel.org/r/20211019002656.17745-1-wangyugui@e16-tech.com
Signed-off-by: wangyugui <wangyugui@e16-tech.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
'destroy_workqueue()' already drains the queue before destroying it, so
there is no need to flush it explicitly.
Remove the redundant 'flush_workqueue()' calls.
This was generated with coccinelle:
@@
expression E;
@@
- flush_workqueue(E);
destroy_workqueue(E);
Link: https://lore.kernel.org/r/ca7bac6e6c9c5cc8d04eec3944edb13de0e381a3.1633874776.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The pointer err_str is being initialized with a value that is never read,
it is being updated later on. The assignment is redundant and can be
removed.
Link: https://lore.kernel.org/r/20211007173942.21933-1-colin.king@canonical.com
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Provide an option to allow users to enable/disable optional counters
through RDMA netlink. Limiting it to users with ADMIN capability only.
Examples:
1. Enable optional counters cc_rx_ce_pkts and cc_rx_cnp_pkts (and
disable all others):
$ sudo rdma statistic set link rocep8s0f0/1 optional-counters \
cc_rx_ce_pkts,cc_rx_cnp_pkts
2. Remove all optional counters:
$ sudo rdma statistic unset link rocep8s0f0/1 optional-counters
Link: https://lore.kernel.org/r/20211008122439.166063-10-markzhang@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In order to allow expansion of the set command with more set options, take
the set mode out of the main set function.
Link: https://lore.kernel.org/r/20211008122439.166063-9-markzhang@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This patch adds the ability to get the name, index and status of all
counters for each link through RDMA netlink. This can be used for
user-space to get the current optional-counter mode.
Examples:
$ rdma statistic mode
link rocep8s0f0/1 optional-counters cc_rx_ce_pkts
$ rdma statistic mode supported
link rocep8s0f0/1 supported optional-counters cc_rx_ce_pkts,cc_rx_cnp_pkts,cc_tx_cnp_pkts
link rocep8s0f1/1 supported optional-counters cc_rx_ce_pkts,cc_rx_cnp_pkts,cc_tx_cnp_pkts
Link: https://lore.kernel.org/r/20211008122439.166063-8-markzhang@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
An optional counter is a driver-specific counter that may be dynamically
enabled/disabled. This enhancement allows drivers to expose counters
which are, for example, mutually exclusive and cannot be enabled at the
same time, counters that might degrades performance, optional debug
counters, etc.
Optional counters are marked with IB_STAT_FLAG_OPTIONAL flag. They are not
exported in sysfs, and must be at the end of all stats, otherwise the
attr->show() in sysfs would get wrong indexes for hwcounters that are
behind optional counters.
Link: https://lore.kernel.org/r/20211008122439.166063-7-markzhang@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add a bitmap in rdma_hw_stat structure, with each bit indicates whether
the corresponding counter is currently disabled or not. By default
hwcounters are enabled.
Link: https://lore.kernel.org/r/20211008122439.166063-6-markzhang@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add a new API rdma_free_hw_stats_struct to pair with
rdma_alloc_hw_stats_struct (which is also de-inlined).
This will be useful when there are more alloc/free works in following
patches.
Link: https://lore.kernel.org/r/20211008122439.166063-5-markzhang@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add a counter statistic descriptor structure in rdma_hw_stats. In addition
to the counter name, more meta-information will be added. This code
extension is needed for optional-counter support in the following patches.
Link: https://lore.kernel.org/r/20211008122439.166063-4-markzhang@nvidia.com
Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There are a couple of subtle error path bugs related to mapping the sgls:
- In rdma_rw_ctx_init(), dma_unmap would be called with an sg that could
have been incremented from the original call, as well as an nents that
is the dma mapped entries not the original number of nents called when
mapped.
- Similarly in rdma_rw_ctx_signature_init, both sg and prot_sg were
unmapped with the incorrect number of nents.
To fix this, switch to the sgtable interface for mapping which
conveniently stores the original nents for unmapping. This will get
cleaned up further once the dma mapping interface supports P2PDMA and
pci_p2pdma_map_sg() can be removed.
Fixes: 0e353e34e1 ("IB/core: add RW API support for signature MRs")
Fixes: a060b5629a ("IB/core: generic RDMA READ/WRITE API")
Link: https://lore.kernel.org/r/20211001213215.3761-1-logang@deltatee.com
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Two list heads in the rdma_id_private are being used for multiple
purposes, to save a few bytes of memory. Give the different purposes
different names and union the memory that is clearly exclusive.
list splits into device_item and listen_any_item. device_item is threaded
onto the cma_device's list and listen_any goes onto the
listen_any_list. IDs doing any listen cannot have devices.
listen_list splits into listen_item and listen_list. listen_list is on the
parent listen any rdma_id_private and listen_item is on child listen that
is bound to a specific cma_dev.
Which name should be used in which case depends on the state and other
factors of the rdma_id_private. Remap all the confusing references to make
sense with the new names, so at least there is some hope of matching the
necessary preconditions with each access.
Link: https://lore.kernel.org/r/0-v1-a5ead4a0c19d+c3a-cma_list_head_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The FSM can run in a circle allowing rdma_resolve_ip() to be called twice
on the same id_priv. While this cannot happen without going through the
work, it violates the invariant that the same address resolution
background request cannot be active twice.
CPU 1 CPU 2
rdma_resolve_addr():
RDMA_CM_IDLE -> RDMA_CM_ADDR_QUERY
rdma_resolve_ip(addr_handler) #1
process_one_req(): for #1
addr_handler():
RDMA_CM_ADDR_QUERY -> RDMA_CM_ADDR_BOUND
mutex_unlock(&id_priv->handler_mutex);
[.. handler still running ..]
rdma_resolve_addr():
RDMA_CM_ADDR_BOUND -> RDMA_CM_ADDR_QUERY
rdma_resolve_ip(addr_handler)
!! two requests are now on the req_list
rdma_destroy_id():
destroy_id_handler_unlock():
_destroy_id():
cma_cancel_operation():
rdma_addr_cancel()
// process_one_req() self removes it
spin_lock_bh(&lock);
cancel_delayed_work(&req->work);
if (!list_empty(&req->list)) == true
! rdma_addr_cancel() returns after process_on_req #1 is done
kfree(id_priv)
process_one_req(): for #2
addr_handler():
mutex_lock(&id_priv->handler_mutex);
!! Use after free on id_priv
rdma_addr_cancel() expects there to be one req on the list and only
cancels the first one. The self-removal behavior of the work only happens
after the handler has returned. This yields a situations where the
req_list can have two reqs for the same "handle" but rdma_addr_cancel()
only cancels the first one.
The second req remains active beyond rdma_destroy_id() and will
use-after-free id_priv once it inevitably triggers.
Fix this by remembering if the id_priv has called rdma_resolve_ip() and
always cancel before calling it again. This ensures the req_list never
gets more than one item in it and doesn't cost anything in the normal flow
that never uses this strange error path.
Link: https://lore.kernel.org/r/0-v1-3bc675b8006d+22-syz_cancel_uaf_jgg@nvidia.com
Cc: stable@vger.kernel.org
Fixes: e51060f08a ("IB: IP address based RDMA connection manager")
Reported-by: syzbot+dc3dfba010d7671e05f5@syzkaller.appspotmail.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
If the state is not idle then rdma_bind_addr() will immediately fail and
no change to global state should happen.
For instance if the state is already RDMA_CM_LISTEN then this will corrupt
the src_addr and would cause the test in cma_cancel_operation():
if (cma_any_addr(cma_src_addr(id_priv)) && !id_priv->cma_dev)
To view a mangled src_addr, eg with a IPv6 loopback address but an IPv4
family, failing the test.
This would manifest as this trace from syzkaller:
BUG: KASAN: use-after-free in __list_add_valid+0x93/0xa0 lib/list_debug.c:26
Read of size 8 at addr ffff8881546491e0 by task syz-executor.1/32204
CPU: 1 PID: 32204 Comm: syz-executor.1 Not tainted 5.12.0-rc8-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack+0x141/0x1d7 lib/dump_stack.c:120
print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:232
__kasan_report mm/kasan/report.c:399 [inline]
kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:416
__list_add_valid+0x93/0xa0 lib/list_debug.c:26
__list_add include/linux/list.h:67 [inline]
list_add_tail include/linux/list.h:100 [inline]
cma_listen_on_all drivers/infiniband/core/cma.c:2557 [inline]
rdma_listen+0x787/0xe00 drivers/infiniband/core/cma.c:3751
ucma_listen+0x16a/0x210 drivers/infiniband/core/ucma.c:1102
ucma_write+0x259/0x350 drivers/infiniband/core/ucma.c:1732
vfs_write+0x28e/0xa30 fs/read_write.c:603
ksys_write+0x1ee/0x250 fs/read_write.c:658
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xae
Which is indicating that an rdma_id_private was destroyed without doing
cma_cancel_listens().
Instead of trying to re-use the src_addr memory to indirectly create an
any address build one explicitly on the stack and bind to that as any
other normal flow would do.
Link: https://lore.kernel.org/r/0-v1-9fbb33f5e201+2a-cma_listen_jgg@nvidia.com
Cc: stable@vger.kernel.org
Fixes: 732d41c545 ("RDMA/cma: Make the locking for automatic state transition more clear")
Reported-by: syzbot+6bb0528b13611047209c@syzkaller.appspotmail.com
Tested-by: Hao Sun <sunhao.th@gmail.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
If cma_listen_on_all() fails it leaves the per-device ID still on the
listen_list but the state is not set to RDMA_CM_ADDR_BOUND.
When the cmid is eventually destroyed cma_cancel_listens() is not called
due to the wrong state, however the per-device IDs are still holding the
refcount preventing the ID from being destroyed, thus deadlocking:
task:rping state:D stack: 0 pid:19605 ppid: 47036 flags:0x00000084
Call Trace:
__schedule+0x29a/0x780
? free_unref_page_commit+0x9b/0x110
schedule+0x3c/0xa0
schedule_timeout+0x215/0x2b0
? __flush_work+0x19e/0x1e0
wait_for_completion+0x8d/0xf0
_destroy_id+0x144/0x210 [rdma_cm]
ucma_close_id+0x2b/0x40 [rdma_ucm]
__destroy_id+0x93/0x2c0 [rdma_ucm]
? __xa_erase+0x4a/0xa0
ucma_destroy_id+0x9a/0x120 [rdma_ucm]
ucma_write+0xb8/0x130 [rdma_ucm]
vfs_write+0xb4/0x250
ksys_write+0xb5/0xd0
? syscall_trace_enter.isra.19+0x123/0x190
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Ensure that cma_listen_on_all() atomically unwinds its action under the
lock during error.
Fixes: c80a0c52d8 ("RDMA/cma: Add missing error handling of listen_id")
Link: https://lore.kernel.org/r/20210913093344.17230-1-thomas.liu@ucloud.cn
Signed-off-by: Tao Liu <thomas.liu@ucloud.cn>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ROCE uses IGMP for Multicast instead of the native Infiniband system where
joins are required in order to post messages on the Multicast group. On
Ethernet one can send Multicast messages to arbitrary addresses without
the need to subscribe to a group.
So ROCE correctly does not send IGMP joins during rdma_join_multicast().
F.e. in cma_iboe_join_multicast() we see:
if (addr->sa_family == AF_INET) {
if (gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP) {
ib.rec.hop_limit = IPV6_DEFAULT_HOPLIMIT;
if (!send_only) {
err = cma_igmp_send(ndev, &ib.rec.mgid,
true);
}
}
} else {
So the IGMP join is suppressed as it is unnecessary.
However no such check is done in destroy_mc(). And therefore leaving a
sendonly multicast group will send an IGMP leave.
This means that the following scenario can lead to a multicast receiver
unexpectedly being unsubscribed from a MC group:
1. Sender thread does a sendonly join on MC group X. No IGMP join
is sent.
2. Receiver thread does a regular join on the same MC Group x.
IGMP join is sent and the receiver begins to get messages.
3. Sender thread terminates and destroys MC group X.
IGMP leave is sent and the receiver no longer receives data.
This patch adds the same logic for sendonly joins to destroy_mc() that is
also used in cma_iboe_join_multicast().
Fixes: ab15c95a17 ("IB/core: Support for CMA multicast join flags")
Link: https://lore.kernel.org/r/alpine.DEB.2.22.394.2109081340540.668072@gentwo.de
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
- Various cleanup and small features for rtrs
- kmap_local_page() conversions
- Driver updates and fixes for: efa, rxe, mlx5, hfi1, qed, hns
- Cache the IB subnet prefix
- Rework how CRC is calcuated in rxe
- Clean reference counting in iwpm's netlink
- Pull object allocation and lifecycle for user QPs to the uverbs core
code
- Several small hns features and continued general code cleanups
- Fix the scatterlist confusion of orig_nents/nents introduced in an
earlier patch creating the append operation
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAmEudRgACgkQOG33FX4g
mxraJA//c6bMxrrTVrzmrtrkyYD4tYWE8RDfgvoyZtleZnnEOJeunCQWakQrpJSv
ukSnOGCA3PtnmRMdV54f/11YJ/7otxOJodSO7jWsIoBrqG/lISAdX8mn2iHhrvJ0
dIaFEFPLy0WqoMLCJVIYIupR0IStVHb/mWx0uYL4XnnoYKyt7f7K5JMZpNWMhDN2
ieJw0jfrvEYm8pipWuxUvB16XARlzAWQrjqLpMRI+jFRpbDVBY21dz2/LJvOJPrA
LcQ+XXsV/F659ibOAGm6bU4BMda8fE6Lw90B/gmhSswJ205NrdziF5cNYHP0QxcN
oMjrjSWWHc9GEE7MTipC2AH8e36qob16Q7CK+zHEJ+ds7R6/O/8XmED1L8/KFpNA
FGqnjxnxsl1y27mUegfj1Hh8PfoDp2oVq0lmpEw0CYo4cfVzHSMRrbTR//XmW628
Ie/mJddpFK4oLk+QkSNjSLrnxOvdTkdA58PU0i84S5eUVMNm41jJDkxg2J7vp0Zn
sclZsclhUQ9oJ5Q2so81JMWxu4JDn7IByXL0ULBaa6xwQTiVEnyvSxSuPlflhLRW
0vI2ylATYKyWkQqyX7VyWecZJzwhwZj5gMMWmoGsij8bkZhQ/VaQMaesByzSth+h
NV5UAYax4GqyOQ/tg/tqT6e5nrI1zof87H64XdTCBpJ7kFyQ/oA=
=ZwOe
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"This is quite a small cycle, no major series stands out. The HNS and
rxe drivers saw the most activity this cycle, with rxe being broken
for a good chunk of time. The significant deleted line count is due to
a SPDX cleanup series.
Summary:
- Various cleanup and small features for rtrs
- kmap_local_page() conversions
- Driver updates and fixes for: efa, rxe, mlx5, hfi1, qed, hns
- Cache the IB subnet prefix
- Rework how CRC is calcuated in rxe
- Clean reference counting in iwpm's netlink
- Pull object allocation and lifecycle for user QPs to the uverbs
core code
- Several small hns features and continued general code cleanups
- Fix the scatterlist confusion of orig_nents/nents introduced in an
earlier patch creating the append operation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (90 commits)
RDMA/mlx5: Relax DCS QP creation checks
RDMA/hns: Delete unnecessary blank lines.
RDMA/hns: Encapsulate the qp db as a function
RDMA/hns: Adjust the order in which irq are requested and enabled
RDMA/hns: Remove RST2RST error prints for hw v1
RDMA/hns: Remove dqpn filling when modify qp from Init to Init
RDMA/hns: Fix QP's resp incomplete assignment
RDMA/hns: Fix query destination qpn
RDMA/hfi1: Convert to SPDX identifier
IB/rdmavt: Convert to SPDX identifier
RDMA/hns: Bugfix for incorrect association between dip_idx and dgid
RDMA/hns: Bugfix for the missing assignment for dip_idx
RDMA/hns: Bugfix for data type of dip_idx
RDMA/hns: Fix incorrect lsn field
RDMA/irdma: Remove the repeated declaration
RDMA/core/sa_query: Retry SA queries
RDMA: Use the sg_table directly and remove the opencoded version from umem
lib/scatterlist: Fix wrong update of orig_nents
lib/scatterlist: Provide a dedicated function to support table append
RDMA/hns: Delete unused hns bitmap interface
...
From Maor Gottlieb
====================
Fix the use of nents and orig_nents in the sg table append helpers. The
nents should be used by the DMA layer to store the number of DMA mapped
sges, the orig_nents is the number of CPU sges.
Since the sg append logic doesn't always create a SGL with exactly
orig_nents entries store a total_nents as well to allow the table to be
properly free'd and reorganize the freeing logic to share across all the
use cases.
====================
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
* 'sg_nents':
RDMA: Use the sg_table directly and remove the opencoded version from umem
lib/scatterlist: Fix wrong update of orig_nents
lib/scatterlist: Provide a dedicated function to support table append
A MAD packet is sent as an unreliable datagram (UD). SA requests are sent
as MAD packets. As such, SA requests or responses may be silently dropped.
IB Core's MAD layer has a timeout and retry mechanism, which amongst
other, is used by RDMA CM. But it is not used by SA queries. The lack of
retries of SA queries leads to long specified timeout, and error being
returned in case of packet loss. The ULP or user-land process has to
perform the retry.
Fix this by taking advantage of the MAD layer's retry mechanism.
First, a check against a zero timeout is added in rdma_resolve_route(). In
send_mad(), we set the MAD layer timeout to one tenth of the specified
timeout and the number of retries to 10. The special case when timeout is
less than 10 is handled.
With this fix:
# ucmatose -c 1000 -S 1024 -C 1
runs stable on an Infiniband fabric. Without this fix, we see an
intermittent behavior and it errors out with:
cmatose: event: RDMA_CM_EVENT_ROUTE_ERROR, error: -110
(110 is ETIMEDOUT)
Link: https://lore.kernel.org/r/1628784755-28316-1-git-send-email-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This allows using the normal sg_table APIs and makes all the code
cleaner. Remove sgt, nents and nmapd from ib_umem.
Link: https://lore.kernel.org/r/20210824142531.3877007-4-maorg@nvidia.com
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
orig_nents should represent the number of entries with pages,
but __sg_alloc_table_from_pages sets orig_nents as the number of
total entries in the table. This is wrong when the API is used for
dynamic allocation where not all the table entries are mapped with
pages. It wasn't observed until now, since RDMA umem who uses this
API in the dynamic form doesn't use orig_nents implicit or explicit
by the scatterlist APIs.
Fix it by changing the append API to track the SG append table
state and have an API to free the append table according to the
total number of entries in the table.
Now all APIs set orig_nents as number of enries with pages.
Fixes: 07da1223ec ("lib/scatterlist: Add support in dynamic allocation of SG table from pages")
Link: https://lore.kernel.org/r/20210824142531.3877007-3-maorg@nvidia.com
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
RDMA is the only in-kernel user that uses __sg_alloc_table_from_pages to
append pages dynamically. In the next patch. That mode will be extended
and that function will get more parameters. So separate it into a unique
function to make such change more clear.
Link: https://lore.kernel.org/r/20210824142531.3877007-2-maorg@nvidia.com
Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
ib_sa_service_rec_query() was introduced in kernel v2.6.13 by
commit cbae32c563 ("[PATCH] IB: Add Service Record support to SA client")
in 2005. It was not used then and have never been used since.
Removing it and related functions/structs.
Link: https://lore.kernel.org/r/1628702736-12651-1-git-send-email-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The dmabuf memory registrations are missing the restrack handling and
hence do not appear in rdma tool.
Fixes: bfe0cc6eb2 ("RDMA/uverbs: Add uverbs command for dma-buf based MR registration")
Link: https://lore.kernel.org/r/20210812135607.6228-1-galpress@amazon.com
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The QP usecnts were incremented through QP attributes structure while
decreased through QP itself. Rely on the ib_creat_qp_user() code that
initialized all QP parameters prior returning to the user and increment
exactly like destroy does.
Link: https://lore.kernel.org/r/25d256a3bb1fc480b77d7fe439817b993de48610.1628014762.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
All QP creation flows called ib_create_qp_security(), but differently.
This caused to the need to provide exclusion conditions for the XRC_TGT,
because such QP already had selinux configuration call.
In order to fix it, move ib_create_qp_security() to the general QP
creation routine.
Link: https://lore.kernel.org/r/4d7cd6f5828aca37fb62283e6b126b73ab86b18c.1628014762.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The low-level create QP function grew to be larger than any sensible
inline function should be. The inline attribute is not really needed for
that function and can be implemented as exported symbol.
Link: https://lore.kernel.org/r/2c08709d86f876c3dfb77684357b2a939e570ca4.1628014762.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The ib_create_named_qp() is kernel verb that is not used for user supplied
attributes. In such case, it is ULP responsibility to provide valid QP
attributes.
In-kernel API shouldn't check it, exactly like other functions that don't
check device capabilities.
Link: https://lore.kernel.org/r/b9b9e981d1af148b750750196e686199dbbf61f8.1628014762.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>