This relied on the probe function only being invoked by the bus type mock
was registered on. The removal of the bus ops broke this assumption and
the probe could be called on non-mock bus types like PCI.
Check the bus type directly in probe.
Fixes: 17de3f5fdd ("iommu: Retire bus ops")
Link: https://lore.kernel.org/r/0-v1-82d59f7eab8c+40c-iommufd_mock_bus_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Allow to test whether IOTLB has been invalidated or not.
Link: https://lore.kernel.org/r/20240111041015.47920-6-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add mock_domain_cache_invalidate_user() data structure to support user
space selftest program to cover user cache invalidation pathway.
Link: https://lore.kernel.org/r/20240111041015.47920-5-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Co-developed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In nested translation, the stage-1 page table is user-managed but cached
by the IOMMU hardware, so an update on present page table entries in the
stage-1 page table should be followed with a cache invalidation.
Add an IOMMU_HWPT_INVALIDATE ioctl to support such a cache invalidation.
It takes hwpt_id to specify the iommu_domain, and a multi-entry array to
support multiple invalidation data in one ioctl.
enum iommu_hwpt_invalidate_data_type is defined to tag the data type of
the entries in the multi-entry array.
Link: https://lore.kernel.org/r/20240111041015.47920-3-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The mixture of kernel and user space lifecycle objects continues to be
complicated inside iommufd. The obj->destroy_rwsem is used to bring order
to the kernel driver destruction sequence but it cannot be sequenced right
with the other refcounts so we end up possibly UAF'ing:
BUG: KASAN: slab-use-after-free in __up_read+0x627/0x750 kernel/locking/rwsem.c:1342
Read of size 8 at addr ffff888073cde868 by task syz-executor934/6535
CPU: 1 PID: 6535 Comm: syz-executor934 Not tainted 6.6.0-rc7-syzkaller-00195-g2af9b20dbb39 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x1b0 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:364 [inline]
print_report+0xc4/0x620 mm/kasan/report.c:475
kasan_report+0xda/0x110 mm/kasan/report.c:588
__up_read+0x627/0x750 kernel/locking/rwsem.c:1342
iommufd_put_object drivers/iommu/iommufd/iommufd_private.h:149 [inline]
iommufd_vfio_ioas+0x46c/0x580 drivers/iommu/iommufd/vfio_compat.c:146
iommufd_fops_ioctl+0x347/0x4d0 drivers/iommu/iommufd/main.c:398
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl fs/ioctl.c:857 [inline]
__x64_sys_ioctl+0x18f/0x210 fs/ioctl.c:857
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
There are two races here, the more obvious one:
CPU 0 CPU 1
iommufd_put_object()
iommufd_destroy()
refcount_dec(&obj->users)
iommufd_object_remove()
kfree()
up_read(&obj->destroy_rwsem) // Boom
And there is also perhaps some possibility that the rwsem could hit an
issue:
CPU 0 CPU 1
iommufd_put_object()
iommufd_object_destroy_user()
refcount_dec(&obj->users);
down_write(&obj->destroy_rwsem)
up_read(&obj->destroy_rwsem);
atomic_long_or(RWSEM_FLAG_WAITERS, &sem->count);
tmp = atomic_long_add_return_release()
rwsem_try_write_lock()
iommufd_object_remove()
up_write(&obj->destroy_rwsem)
kfree()
clear_nonspinnable() // Boom
Fix this by reorganizing this again so that two refcounts are used to keep
track of things with a rule that users == 0 && shortterm_users == 0 means
no other threads have that memory. Put a wait_queue in the iommufd_ctx
object that is triggered when any sub object reaches a 0
shortterm_users. This allows the same wait for userspace ioctls to finish
behavior that the rwsem was providing.
This is weaker still than the prior versions:
- There is no bias on shortterm_users so if some thread is waiting to
destroy other threads can continue to get new read sides
- If destruction fails, eg because of an active in-kernel user, then
shortterm_users will have cycled to zero momentarily blocking new users
- If userspace races destroy with other userspace operations they
continue to get an EBUSY since we still can't intermix looking up an ID
and sleeping for its unref
In all cases these are things that userspace brings on itself, correct
programs will not hit them.
Fixes: 99f98a7c0d ("iommufd: IOMMUFD_DESTROY should not increase the refcount")
Link: https://lore.kernel.org/all/2-v2-ca9e00171c5b+123-iommufd_syz4_jgg@nvidia.com/
Reported-by: syzbot+d31adfb277377ef8fcba@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/00000000000055ef9a0609336580@google.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Before we can allow drivers to coexist, we need to make sure that one
driver's domain ops can't misinterpret another driver's dev_iommu_priv
data. To that end, add a token to the domain so we can remember how it
was allocated - for now this may as well be the device ops, since they
still correlate 1:1 with drivers. We can trust ourselves for internal
default domain attachment, so add checks to cover all the public attach
interfaces.
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/097c6f30480e4efe12195d00ba0e84ea4837fb4c.1700589539.git.robin.murphy@arm.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Including:
- Core changes:
- Make default-domains mandatory for all IOMMU drivers
- Remove group refcounting
- Add generic_single_device_group() helper and consolidate
drivers
- Cleanup map/unmap ops
- Scaling improvements for the IOVA rcache depot
- Convert dart & iommufd to the new domain_alloc_paging()
- ARM-SMMU:
- Device-tree binding update:
- Add qcom,sm7150-smmu-v2 for Adreno on SM7150 SoC
- SMMUv2:
- Support for Qualcomm SDM670 (MDSS) and SM7150 SoCs
- SMMUv3:
- Large refactoring of the context descriptor code to
move the CD table into the master, paving the way
for '->set_dev_pasid()' support on non-SVA domains
- Minor cleanups to the SVA code
- Intel VT-d:
- Enable debugfs to dump domain attached to a pasid
- Remove an unnecessary inline function.
- AMD IOMMU:
- Initial patches for SVA support (not complete yet)
- S390 IOMMU:
- DMA-API conversion and optimized IOTLB flushing
- Some smaller fixes and improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmVJFcEACgkQK/BELZcB
GuMgDxAAsnYVQjQ7wRkwR0rHARuEaJ+Lz2vkLNH+uYXjBzhFe2bT+ykMcZysAkdK
A5PMLOFT5Etf+PAqOM0CoIGQFOefAId6uGl7S61Fp9ZWDKhMrOBFWhxGOaufA1Du
tNvt3i66hwPSDZa82kY3wRCluYtj0aBBzmM6ZTwBwFZdQ7LABMtE8OxisqncVvq0
H6vhV213fqvhCFSQJ6PnTAEiv70WvWBWygA+Z/gwYf9hypZQae91PNXdK9313a9z
OvCzGBkL/R5/3KkJd88UhFwyYzyNGxq/DmH1etawYR5gYZ8UT/Z/sYpcx9hlO7qr
eENPqeQc+YHZXpKqkaq66HBA1FSnXUqRZLl4cVaZahRRMe/yArsBM6R0W1AfkMAR
rZxwHKoHUWeuHQLMVvmSDNL57h/GJJpTXjRc8HMxLZkVp+ScvnT5XCYHWWzRdCdx
TcC/pJ1tet0FQ8rw09ovlwpGVA6eojWvcpVbLVLfGN8ZWViSVfvNFoPNb7HsGK6M
iRi+L41Y7s63cyogC/Gsae2RAvYv29ZpvE91lmon2u+VBlTpMdOFX9EhWS6RqOBF
cV30bhsw0dyCB7v5jDPtABYEOaR6l1mPLhn1gX3u0Ue/tmPhLX69k4bVWBY6wP3p
gmmJD9ub8FuPQtFCGPE7/8ZINjGGrfiKO24DNI2Ty3XEeq21hU4=
=UyWC
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates from Joerg Roedel:
"Core changes:
- Make default-domains mandatory for all IOMMU drivers
- Remove group refcounting
- Add generic_single_device_group() helper and consolidate drivers
- Cleanup map/unmap ops
- Scaling improvements for the IOVA rcache depot
- Convert dart & iommufd to the new domain_alloc_paging()
ARM-SMMU:
- Device-tree binding update:
- Add qcom,sm7150-smmu-v2 for Adreno on SM7150 SoC
- SMMUv2:
- Support for Qualcomm SDM670 (MDSS) and SM7150 SoCs
- SMMUv3:
- Large refactoring of the context descriptor code to move the CD
table into the master, paving the way for '->set_dev_pasid()'
support on non-SVA domains
- Minor cleanups to the SVA code
Intel VT-d:
- Enable debugfs to dump domain attached to a pasid
- Remove an unnecessary inline function
AMD IOMMU:
- Initial patches for SVA support (not complete yet)
S390 IOMMU:
- DMA-API conversion and optimized IOTLB flushing
And some smaller fixes and improvements"
* tag 'iommu-updates-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (102 commits)
iommu/dart: Remove the force_bypass variable
iommu/dart: Call apple_dart_finalize_domain() as part of alloc_paging()
iommu/dart: Convert to domain_alloc_paging()
iommu/dart: Move the blocked domain support to a global static
iommu/dart: Use static global identity domains
iommufd: Convert to alloc_domain_paging()
iommu/vt-d: Use ops->blocked_domain
iommu/vt-d: Update the definition of the blocking domain
iommu: Move IOMMU_DOMAIN_BLOCKED global statics to ops->blocked_domain
Revert "iommu/vt-d: Remove unused function"
iommu/amd: Remove DMA_FQ type from domain allocation path
iommu: change iommu_map_sgtable to return signed values
iommu/virtio: Add __counted_by for struct viommu_request and use struct_size()
iommu/vt-d: debugfs: Support dumping a specified page table
iommu/vt-d: debugfs: Create/remove debugfs file per {device, pasid}
iommu/vt-d: debugfs: Dump entry pointing to huge page
iommu/vt-d: Remove unused function
iommu/arm-smmu-v3-sva: Remove bond refcount
iommu/arm-smmu-v3-sva: Remove unused iommu_sva handle
iommu/arm-smmu-v3: Rename cdcfg to cd_table
...
Patches in Joerg's iommu tree to convert the mock driver to use
domain_alloc_paging() that clash badly with the way the selftest changes
for nesting were structured.
Massage the selftest so that it looks closer the code after the
domain_alloc_paging() conversion to ease the merge. Change
__mock_domain_alloc_paging() into mock_domain_alloc_paging() in the same
way as the iommu tree. The merge resolution then trivially takes both and
deletes mock_domain_alloc().
Link: https://lore.kernel.org/r/0-v1-90a855762c96+19de-mock_merge_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
iommufd_test_dirty()/IOMMU_TEST_OP_DIRTY sets the dirty bits in the mock
domain implementation that the userspace side validates against what it
obtains via the UAPI.
However in introducing iommufd_test_dirty() it forgot to validate page_size
being 0 leading to two possible divide-by-zero problems: one at the
beginning when calculating @max and while calculating the IOVA in the
XArray PFN tracking list.
While at it, validate the length to require non-zero value as well, as we
can't be allocating a 0-sized bitmap.
Link: https://lore.kernel.org/r/20231030113446.7056-1-joao.m.martins@oracle.com
Reported-by: syzbot+25dc7383c30ecdc83c38@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-iommu/00000000000005f6aa0608b9220f@google.com/
Fixes: a9af47e382 ("iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_BITMAP")
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
We never initialize the two interval tree nodes, and zero fill is not the
same as RB_CLEAR_NODE. This can hide issues where we missed adding the
area to the trees. Factor out the allocation and clear the two nodes.
Fixes: 51fe6141f0 ("iommufd: Data structure to provide IOVA to PFN mapping")
Link: https://lore.kernel.org/r/20231030145035.GG691768@ziepe.ca
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In iopt_area_split(), if the original iopt_area has filled a domain and is
linked to domains_itree, pages_nodes have to be properly
reinserted. Otherwise the domains_itree becomes corrupted and we will UAF.
Fixes: 51fe6141f0 ("iommufd: Data structure to provide IOVA to PFN mapping")
Link: https://lore.kernel.org/r/20231027162941.2864615-2-den@valinux.co.jp
Cc: stable@vger.kernel.org
Signed-off-by: Koichiro Den <den@valinux.co.jp>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Move the global static blocked domain to the ops and convert the unmanaged
domain to domain_alloc_paging.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Sven Peter <sven@svenpeter.dev>
Link: https://lore.kernel.org/r/4-v2-bff223cf6409+282-dart_paging_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Add nested domain support in the ->domain_alloc_user op with some proper
sanity checks. Then, add a domain_nested_ops for all nested domains and
split the get_md_pagetable helper into paging and nested helpers.
Also, add an iotlb as a testing property of a nested domain.
Link: https://lore.kernel.org/r/20231026043938.63898-10-yi.l.liu@intel.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
IOMMU_HWPT_ALLOC already supports iommu_domain allocation for usersapce.
But it can only allocate a hw_pagetable that associates to a given IOAS,
i.e. only a kernel-managed hw_pagetable of IOMMUFD_OBJ_HWPT_PAGING type.
IOMMU drivers can now support user-managed hw_pagetables, for two-stage
translation use cases that require user data input from the user space.
Add a new IOMMUFD_OBJ_HWPT_NESTED type with its abort/destroy(). Pair it
with a new iommufd_hwpt_nested structure and its to_hwpt_nested() helper.
Update the to_hwpt_paging() helper, so a NESTED-type hw_pagetable can be
handled in the callers, for example iommufd_hw_pagetable_enforce_rr().
Screen the inputs including the parent PAGING-type hw_pagetable that has
a need of a new nest_parent flag in the iommufd_hwpt_paging structure.
Extend the IOMMU_HWPT_ALLOC ioctl to accept an IOMMU driver specific data
input which is tagged by the enum iommu_hwpt_data_type. Also, update the
@pt_id to accept hwpt_id too besides an ioas_id. Then, use them to allocate
a hw_pagetable of IOMMUFD_OBJ_HWPT_NESTED type using the
iommufd_hw_pagetable_alloc_nested() allocator.
Link: https://lore.kernel.org/r/20231026043938.63898-8-yi.l.liu@intel.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Co-developed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
domain_alloc_user op already accepts user flags for domain allocation, add
a parent domain pointer and a driver specific user data support as well.
The user data would be tagged with a type for iommu drivers to add their
own driver specific user data per hw_pagetable.
Add a struct iommu_user_data as a bundle of data_ptr/data_len/type from an
iommufd core uAPI structure. Make the user data opaque to the core, since
a userspace driver must match the kernel driver. In the future, if drivers
share some common parameter, there would be a generic parameter as well.
Link: https://lore.kernel.org/r/20231026043938.63898-7-yi.l.liu@intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Allow iommufd_hwpt_alloc() to have a common routine but jump to different
allocators corresponding to different user input pt_obj types, either an
IOMMUFD_OBJ_IOAS for a PAGING hwpt or an IOMMUFD_OBJ_HWPT_PAGING as the
parent for a NESTED hwpt.
Also, move the "flags" validation to the hwpt allocator (paging), so that
later the hwpt_nested allocator can do its own separate flags validation.
Link: https://lore.kernel.org/r/20231026043938.63898-6-yi.l.liu@intel.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
To prepare for IOMMUFD_OBJ_HWPT_NESTED, derive struct iommufd_hwpt_paging
from struct iommufd_hw_pagetable, by leaving the common members in struct
iommufd_hw_pagetable. Add a __iommufd_object_alloc and to_hwpt_paging()
helpers for the new structure.
Then, update "hwpt" to "hwpt_paging" throughout the files, accordingly.
Link: https://lore.kernel.org/r/20231026043938.63898-5-yi.l.liu@intel.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Some of the configurations during the attach/replace() should only apply
to IOMMUFD_OBJ_HWPT_PAGING. Once IOMMUFD_OBJ_HWPT_NESTED gets introduced
in a following patch, keeping them unconditionally in the common routine
will not work.
Wrap all of those PAGING-only configurations together into helpers. Do a
hwpt_is_paging check whenever calling them or their fallback routines.
Link: https://lore.kernel.org/r/20231026043938.63898-4-yi.l.liu@intel.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
To add a new IOMMUFD_OBJ_HWPT_NESTED, rename the HWPT object to confine
it to PAGING hwpts/domains. The following patch will separate the hwpt
structure as well.
Link: https://lore.kernel.org/r/20231026043938.63898-3-yi.l.liu@intel.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
According to the conversation in the following link:
https://lore.kernel.org/linux-iommu/20231020135501.GG3952@nvidia.com/
The enforce_cache_coherency should be set/enforced in the hwpt allocation
routine. The iommu driver in its attach_dev() op should decide whether to
reject or not a device that doesn't match with the configuration of cache
coherency. Drop the enforce_cache_coherency piece in the attach/replace()
and move the remaining "num_devices" piece closer to the refcount that is
using it.
Accordingly drop its function prototype in the header and mark it static.
Also add some extra comments to clarify the expected behaviors.
Link: https://lore.kernel.org/r/20231024012958.30842-1-nicolinc@nvidia.com
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Change test_mock_dirty_bitmaps() to pass a flag where it specifies the flag
under test. The test does the same thing as the GET_DIRTY_BITMAP regular
test. Except that it tests whether the dirtied bits are fetched all the
same a second time, as opposed to observing them cleared.
Link: https://lore.kernel.org/r/20231024135109.73787-19-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Enumerate the capabilities from the mock device and test whether it
advertises as expected. Include it as part of the iommufd_dirty_tracking
fixture.
Link: https://lore.kernel.org/r/20231024135109.73787-18-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add a new test ioctl for simulating the dirty IOVAs in the mock domain, and
implement the mock iommu domain ops that get the dirty tracking supported.
The selftest exercises the usual main workflow of:
1) Setting dirty tracking from the iommu domain
2) Read and clear dirty IOPTEs
Different fixtures will test different IOVA range sizes, that exercise
corner cases of the bitmaps.
Link: https://lore.kernel.org/r/20231024135109.73787-17-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Change mock_domain to supporting dirty tracking and add tests to exercise
the new SET_DIRTY_TRACKING API in the iommufd_dirty_tracking selftest
fixture.
Link: https://lore.kernel.org/r/20231024135109.73787-16-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In order to selftest the iommu domain dirty enforcing implement the
mock_domain necessary support and add a new dev_flags to test that the
hwpt_alloc/attach_device fails as expected.
Expand the existing mock_domain fixture with a enforce_dirty test that
exercises the hwpt_alloc and device attachment.
Link: https://lore.kernel.org/r/20231024135109.73787-15-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Expand mock_domain test to be able to manipulate the device capabilities.
This allows testing with mockdev without dirty tracking support advertised
and thus make sure enforce_dirty test does the expected.
To avoid breaking IOMMUFD_TEST UABI replicate the mock_domain struct and
thus add an input dev_flags at the end.
Link: https://lore.kernel.org/r/20231024135109.73787-14-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
VFIO has an operation where it unmaps an IOVA while returning a bitmap with
the dirty data. In reality the operation doesn't quite query the IO
pagetables that the PTE was dirty or not. Instead it marks as dirty on
anything that was mapped, and doing so in one syscall.
In IOMMUFD the equivalent is done in two operations by querying with
GET_DIRTY_IOVA followed by UNMAP_IOVA. However, this would incur two TLB
flushes given that after clearing dirty bits IOMMU implementations require
invalidating their IOTLB, plus another invalidation needed for the UNMAP.
To allow dirty bits to be queried faster, add a flag
(IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR) that requests to not clear the dirty
bits from the PTE (but just reading them), under the expectation that the
next operation is the unmap. An alternative is to unmap and just
perpectually mark as dirty as that's the same behaviour as today. So here
equivalent functionally can be provided with unmap alone, and if real dirty
info is required it will amortize the cost while querying.
There's still a race against DMA where in theory the unmap of the IOVA
(when the guest invalidates the IOTLB via emulated iommu) would race
against the VF performing DMA on the same IOVA. As discussed in [0], we are
accepting to resolve this race as throwing away the DMA and it doesn't
matter if it hit physical DRAM or not, the VM can't tell if we threw it
away because the DMA was blocked or because we failed to copy the DRAM.
[0] https://lore.kernel.org/linux-iommu/20220502185239.GR8364@nvidia.com/
Link: https://lore.kernel.org/r/20231024135109.73787-10-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Extend IOMMUFD_CMD_GET_HW_INFO op to query generic iommu capabilities for a
given device.
Capabilities are IOMMU agnostic and use device_iommu_capable() API passing
one of the IOMMU_CAP_*. Enumerate IOMMU_CAP_DIRTY_TRACKING for now in the
out_capabilities field returned back to userspace.
Link: https://lore.kernel.org/r/20231024135109.73787-9-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Connect a hw_pagetable to the IOMMU core dirty tracking
read_and_clear_dirty iommu domain op. It exposes all of the functionality
for the UAPI that read the dirtied IOVAs while clearing the Dirty bits from
the PTEs.
In doing so, add an IO pagetable API iopt_read_and_clear_dirty_data() that
performs the reading of dirty IOPTEs for a given IOVA range and then
copying back to userspace bitmap.
Underneath it uses the IOMMU domain kernel API which will read the dirty
bits, as well as atomically clearing the IOPTE dirty bit and flushing the
IOTLB at the end. The IOVA bitmaps usage takes care of the iteration of the
bitmaps user pages efficiently and without copies. Within the iterator
function we iterate over io-pagetable contigous areas that have been
mapped.
Contrary to past incantation of a similar interface in VFIO the IOVA range
to be scanned is tied in to the bitmap size, thus the application needs to
pass a appropriately sized bitmap address taking into account the iova
range being passed *and* page size ... as opposed to allowing bitmap-iova
!= iova.
Link: https://lore.kernel.org/r/20231024135109.73787-8-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Every IOMMU driver should be able to implement the needed iommu domain ops
to control dirty tracking.
Connect a hw_pagetable to the IOMMU core dirty tracking ops, specifically
the ability to enable/disable dirty tracking on an IOMMU domain
(hw_pagetable id). To that end add an io_pagetable kernel API to toggle
dirty tracking:
* iopt_set_dirty_tracking(iopt, [domain], state)
The intended caller of this is via the hw_pagetable object that is created.
Internally it will ensure the leftover dirty state is cleared /right
before/ dirty tracking starts. This is also useful for iommu drivers which
may decide that dirty tracking is always-enabled at boot without wanting to
toggle dynamically via corresponding iommu domain op.
Link: https://lore.kernel.org/r/20231024135109.73787-7-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Throughout IOMMU domain lifetime that wants to use dirty tracking, some
guarantees are needed such that any device attached to the iommu_domain
supports dirty tracking.
The idea is to handle a case where IOMMU in the system are assymetric
feature-wise and thus the capability may not be supported for all devices.
The enforcement is done by adding a flag into HWPT_ALLOC namely:
IOMMU_HWPT_ALLOC_DIRTY_TRACKING
.. Passed in HWPT_ALLOC ioctl() flags. The enforcement is done by creating
a iommu_domain via domain_alloc_user() and validating the requested flags
with what the device IOMMU supports (and failing accordingly) advertised).
Advertising the new IOMMU domain feature flag requires that the individual
iommu driver capability is supported when a future device attachment
happens.
Link: https://lore.kernel.org/r/20231024135109.73787-6-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Have the IOVA bitmap exported symbols adhere to the IOMMUFD symbol
export convention i.e. using the IOMMUFD namespace. In doing so,
import the namespace in the current users. This means VFIO and the
vfio-pci drivers that use iova_bitmap_set().
Link: https://lore.kernel.org/r/20231024135109.73787-4-joao.m.martins@oracle.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Both VFIO and IOMMUFD will need iova bitmap for storing dirties and walking
the user bitmaps, so move to the common dependency into IOMMUFD. In doing
so, create the symbol IOMMUFD_DRIVER which designates the builtin code that
will be used by drivers when selected. Today this means MLX5_VFIO_PCI and
PDS_VFIO_PCI. IOMMU drivers will do the same (in future patches) when
supporting dirty tracking and select IOMMUFD_DRIVER accordingly.
Given that the symbol maybe be disabled, add header definitions in
iova_bitmap.h for when IOMMUFD_DRIVER=n
Link: https://lore.kernel.org/r/20231024135109.73787-3-joao.m.martins@oracle.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add mock_domain_alloc_user() and a new test case for
IOMMU_HWPT_ALLOC_NEST_PARENT.
Link: https://lore.kernel.org/r/20230928071528.26258-6-yi.l.liu@intel.com
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Extend IOMMU_HWPT_ALLOC to allocate domains to be used as parent (stage-2)
in nested translation.
Add IOMMU_HWPT_ALLOC_NEST_PARENT to the uAPI.
Link: https://lore.kernel.org/r/20230928071528.26258-5-yi.l.liu@intel.com
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Extends iommufd_hw_pagetable_alloc() to accept user flags, the uAPI will
provide the flags.
Link: https://lore.kernel.org/r/20230928071528.26258-4-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Make IOMMUFD use iommu_domain_alloc_user() by default for iommu_domain
creation. IOMMUFD needs to support iommu_domain allocation with parameters
from userspace in nested support, and a driver is expected to implement
everything under this op.
If the iommu driver doesn't provide domain_alloc_user callback then
IOMMUFD falls back to use iommu_domain_alloc() with an UNMANAGED type if
possible.
Link: https://lore.kernel.org/r/20230928071528.26258-3-yi.l.liu@intel.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This is used when the iommu driver is taking control of the dma_ops,
currently only on S390 and power spapr. It is designed to preserve the
original ops->detach_dev() semantic that these S390 was built around.
Provide an opaque domain type and a 'default_domain' ops value that allows
the driver to trivially force any single domain as the default domain.
Update iommufd selftest to use this instead of set_platform_dma_ops
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v8-81230027b2fa+9d-iommu_all_defdom_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This allows a driver to set a global static to an IDENTITY domain and
the core code will automatically use it whenever an IDENTITY domain
is requested.
By making it always available it means the IDENTITY can be used in error
handling paths to force the iommu driver into a known state. Devices
implementing global static identity domains should avoid failing their
attach_dev ops.
To make global static domains simpler allow drivers to omit their free
function and update the iommufd selftest.
Convert rockchip to use the new mechanism.
Tested-by: Steven Price <steven.price@arm.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v8-81230027b2fa+9d-iommu_all_defdom_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
This includes a shared branch with VFIO:
- Enhance VFIO_DEVICE_GET_PCI_HOT_RESET_INFO so it can work with iommufd
FDs, not just group FDs. This removes the last place in the uAPI that
required the group fd.
- Give VFIO a new device node /dev/vfio/devices/vfioX (the so called cdev
node) which is very similar to the FD from VFIO_GROUP_GET_DEVICE_FD.
The cdev is associated with the struct device that the VFIO driver is
bound to and shows up in sysfs in the normal way.
- Add a cdev IOCTL VFIO_DEVICE_BIND_IOMMUFD which allows a newly opened
/dev/vfio/devices/vfioX to be associated with an IOMMUFD, this replaces
the VFIO_GROUP_SET_CONTAINER flow.
- Add cdev IOCTLs VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT to allow the IOMMU
translation the vfio_device is associated with to be changed. This is a
significant new feature for VFIO as previously each vfio_device was
fixed to a single translation.
The translation is under the control of iommufd, so it can be any of
the different translation modes that iommufd is learning to create.
At this point VFIO has compilation options to remove the legacy interfaces
and in modern mode it behaves like a normal driver subsystem. The
/dev/vfio/iommu and /dev/vfio/groupX nodes are not present and each
vfio_device only has a /dev/vfio/devices/vfioX cdev node that represents
the device.
On top of this is built some of the new iommufd functionality:
- IOMMU_HWPT_ALLOC allows userspace to directly create the low level
IO Page table objects and affiliate them with IOAS objects that hold
the translation mapping. This is the basic functionality for the
normal IOMMU_DOMAIN_PAGING domains.
- VFIO_DEVICE_ATTACH_IOMMUFD_PT can be used to replace the current
translation. This is wired up to through all the layers down to the
driver so the driver has the ability to implement a hitless
replacement. This is necessary to fully support guest behaviors when
emulating HW (eg guest atomic change of translation)
- IOMMU_GET_HW_INFO returns information about the IOMMU driver HW that
owns a VFIO device. This includes support for the Intel iommu, and
patches have been posted for all the other server IOMMU.
Along the way are a number of internal items:
- New iommufd kapis iommufd_ctx_has_group(), iommufd_device_to_ictx(),
iommufd_device_to_id(), iommufd_access_detach(), iommufd_ctx_from_fd(),
iommufd_device_replace()
- iommufd now internally tracks iommu_groups as it needs some per-group
data
- Reorganize how the internal hwpt allocation flows to have more robust
locking
- Improve the access interfaces to support detach and replace of an IOAS
from an access
- New selftests and a rework of how the selftests creates a mock iommu
driver to be more like a real iommu driver
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZO/QDQAKCRCFwuHvBreF
YZ2iAP4hNEF6MJLRI2A28V3I/80f3x9Ed3Cirp/Q8ZdVEE+HYQD8DFaafJ0y3iPQ
5mxD4ZrZ9KfUns/gUqCT5oPHjrcvSAM=
=EQCw
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"On top of the vfio updates is built some new iommufd functionality:
- IOMMU_HWPT_ALLOC allows userspace to directly create the low level
IO Page table objects and affiliate them with IOAS objects that
hold the translation mapping. This is the basic functionality for
the normal IOMMU_DOMAIN_PAGING domains.
- VFIO_DEVICE_ATTACH_IOMMUFD_PT can be used to replace the current
translation. This is wired up to through all the layers down to the
driver so the driver has the ability to implement a hitless
replacement. This is necessary to fully support guest behaviors
when emulating HW (eg guest atomic change of translation)
- IOMMU_GET_HW_INFO returns information about the IOMMU driver HW
that owns a VFIO device. This includes support for the Intel iommu,
and patches have been posted for all the other server IOMMU.
Along the way are a number of internal items:
- New iommufd kernel APIs: iommufd_ctx_has_group(),
iommufd_device_to_ictx(), iommufd_device_to_id(),
iommufd_access_detach(), iommufd_ctx_from_fd(),
iommufd_device_replace()
- iommufd now internally tracks iommu_groups as it needs some
per-group data
- Reorganize how the internal hwpt allocation flows to have more
robust locking
- Improve the access interfaces to support detach and replace of an
IOAS from an access
- New selftests and a rework of how the selftests creates a mock
iommu driver to be more like a real iommu driver"
Link: https://lore.kernel.org/lkml/ZO%2FTe6LU1ENf58ZW@nvidia.com/
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (34 commits)
iommufd/selftest: Don't leak the platform device memory when unloading the module
iommu/vt-d: Implement hw_info for iommu capability query
iommufd/selftest: Add coverage for IOMMU_GET_HW_INFO ioctl
iommufd: Add IOMMU_GET_HW_INFO
iommu: Add new iommu op to get iommu hardware information
iommu: Move dev_iommu_ops() to private header
iommufd: Remove iommufd_ref_to_users()
iommufd/selftest: Make the mock iommu driver into a real driver
vfio: Support IO page table replacement
iommufd/selftest: Add IOMMU_TEST_OP_ACCESS_REPLACE_IOAS coverage
iommufd: Add iommufd_access_replace() API
iommufd: Use iommufd_access_change_ioas in iommufd_access_destroy_object
iommufd: Add iommufd_access_change_ioas(_id) helpers
iommufd: Allow passing in iopt_access_list_id to iopt_remove_access()
vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages()
iommufd/selftest: Add a selftest for IOMMU_HWPT_ALLOC
iommufd/selftest: Return the real idev id from selftest mock_domain
iommufd: Add IOMMU_HWPT_ALLOC
iommufd/selftest: Test iommufd_device_replace()
iommufd: Make destroy_rwsem use a lock class per object type
...
- VFIO direct character device (cdev) interface support. This extracts
the vfio device fd from the container and group model, and is intended
to be the native uAPI for use with IOMMUFD. (Yi Liu)
- Enhancements to the PCI hot reset interface in support of cdev usage.
(Yi Liu)
- Fix a potential race between registering and unregistering vfio files
in the kvm-vfio interface and extend use of a lock to avoid extra
drop and acquires. (Dmitry Torokhov)
- A new vfio-pci variant driver for the AMD/Pensando Distributed Services
Card (PDS) Ethernet device, supporting live migration. (Brett Creeley)
- Cleanups to remove redundant owner setup in cdx and fsl bus drivers,
and simplify driver init/exit in fsl code. (Li Zetao)
- Fix uninitialized hole in data structure and pad capability structures
for alignment. (Stefan Hajnoczi)
-----BEGIN PGP SIGNATURE-----
iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmTvnDUbHGFsZXgud2ls
bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsimEEP/AzG+VRcu5LfYbLGLe0z
zB8ts6G7S78wXlmfN/LYi3v92XWvMMcm+vYF8oNAMfr1YL5sibWN6UtQfY1KCr7h
nWKdQdqjajJ5yDDZnOFdhqHJGNfmZw6+fey8Z0j8zRI2oymK4DncWWX3g/7L1SNr
9tIexGJef+mOdAmC94yOut3YviAaZ+f95T/xrdXHzzoNr50DD0+PD6AJdKJfKggP
vhiC/DAYH3Fofaa6tRasgWuKCYWdjZLR/kxgNpeEmW6kZnbq/dnzZ+kgn4HH1f9G
8p7UKVARR6FfG5aLheWu6Y9PDaKnfnqu8y/hobuE/ivXcmqqK+a6xSxrjgbVs8WJ
94SYnTBRoTlDJaKWa7GxqdgzJnV+s5ZyAgPhjzdi6mLTPWGzkuLhFWGtYL+LZAQ6
pNeZSM6CFBk+bva/xT0nNPCXxPh+/j/Y0G18FREj8aPFc03HrJQqz0RLydvTnoDz
nX/by5KdzMSVSVLPr4uDMtAsgxsGqWiFcp7QMw1HhhlLWxqmYbA+mLZaqyMZUUOx
6b/P8WXT9P2I+qPVKWQ5CWyqpsEqm6P+72yg6LOM9kINvgwDhOa7cagMXIuMWYMH
Rf97FL+K8p1eIy6AnvRHgFBMM5185uG+0YcJyVqtucDr/k8T/Om6ujAI6JbWtNe6
cLgaVAqKOYqCR4HC9bfVGSbd
=eKSR
-----END PGP SIGNATURE-----
Merge tag 'vfio-v6.6-rc1' of https://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- VFIO direct character device (cdev) interface support. This extracts
the vfio device fd from the container and group model, and is
intended to be the native uAPI for use with IOMMUFD (Yi Liu)
- Enhancements to the PCI hot reset interface in support of cdev usage
(Yi Liu)
- Fix a potential race between registering and unregistering vfio files
in the kvm-vfio interface and extend use of a lock to avoid extra
drop and acquires (Dmitry Torokhov)
- A new vfio-pci variant driver for the AMD/Pensando Distributed
Services Card (PDS) Ethernet device, supporting live migration (Brett
Creeley)
- Cleanups to remove redundant owner setup in cdx and fsl bus drivers,
and simplify driver init/exit in fsl code (Li Zetao)
- Fix uninitialized hole in data structure and pad capability
structures for alignment (Stefan Hajnoczi)
* tag 'vfio-v6.6-rc1' of https://github.com/awilliam/linux-vfio: (53 commits)
vfio/pds: Send type for SUSPEND_STATUS command
vfio/pds: fix return value in pds_vfio_get_lm_file()
pds_core: Fix function header descriptions
vfio: align capability structures
vfio/type1: fix cap_migration information leak
vfio/fsl-mc: Use module_fsl_mc_driver macro to simplify the code
vfio/cdx: Remove redundant initialization owner in vfio_cdx_driver
vfio/pds: Add Kconfig and documentation
vfio/pds: Add support for firmware recovery
vfio/pds: Add support for dirty page tracking
vfio/pds: Add VFIO live migration support
vfio/pds: register with the pds_core PF
pds_core: Require callers of register/unregister to pass PF drvdata
vfio/pds: Initial support for pds VFIO driver
vfio: Commonize combine_ranges for use in other VFIO drivers
kvm/vfio: avoid bouncing the mutex when adding and deleting groups
kvm/vfio: ensure kvg instance stays around in kvm_vfio_group_add()
docs: vfio: Add vfio device cdev description
vfio: Compile vfio_group infrastructure optionally
vfio: Move the IOMMU_CAP_CACHE_COHERENCY check in __vfio_register_dev()
...
It should call platform_device_unregister() instead of
platform_device_del() to unregister and free the device.
Fixes: 23a1b46f15 ("iommufd/selftest: Make the mock iommu driver into a real driver")
Link: https://lore.kernel.org/r/20230816081318.1232865-1-yangyingliang@huawei.com
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add a mock_domain_hw_info function and an iommu_test_hw_info data
structure. This allows to test the IOMMU_GET_HW_INFO ioctl passing the
test_reg value for the mock_dev.
Link: https://lore.kernel.org/r/20230818101033.4100-5-yi.l.liu@intel.com
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Under nested IOMMU translation, userspace owns the stage-1 translation
table (e.g. the stage-1 page table of Intel VT-d or the context table of
ARM SMMUv3, and etc.). Stage-1 translation tables are vendor specific, and
need to be compatible with the underlying IOMMU hardware. Hence, userspace
should know the IOMMU hardware capability before creating and configuring
the stage-1 translation table to kernel.
This adds IOMMU_GET_HW_INFO ioctl to query the IOMMU hardware information
(a.k.a capability) for a given device. The returned data is vendor
specific, userspace needs to decode it with the structure by the output
@out_data_type field.
As only physical devices have IOMMU hardware, so this will return error if
the given device is not a physical device.
Link: https://lore.kernel.org/r/20230818101033.4100-4-yi.l.liu@intel.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The VFIO_DEVICE_GET_INFO, VFIO_DEVICE_GET_REGION_INFO, and
VFIO_IOMMU_GET_INFO ioctls fill in an info struct followed by capability
structs:
+------+---------+---------+-----+
| info | caps[0] | caps[1] | ... |
+------+---------+---------+-----+
Both the info and capability struct sizes are not always multiples of
sizeof(u64), leaving u64 fields in later capability structs misaligned.
Userspace applications currently need to handle misalignment manually in
order to support CPU architectures and programming languages with strict
alignment requirements.
Make life easier for userspace by ensuring alignment in the kernel. This
is done by padding info struct definitions and by copying out zeroes
after capability structs that are not aligned.
The new layout is as follows:
+------+---------+---+---------+-----+
| info | caps[0] | 0 | caps[1] | ... |
+------+---------+---+---------+-----+
In this example caps[0] has a size that is not multiples of sizeof(u64),
so zero padding is added to align the subsequent structure.
Adding zero padding between structs does not break the uapi. The memory
layout is specified by the info.cap_offset and caps[i].next fields
filled in by the kernel. Applications use these field values to locate
structs and are therefore unaffected by the addition of zero padding.
Note that code that copies out info structs with padding is updated to
always zero the struct and copy out as many bytes as userspace
requested. This makes the code shorter and avoids potential information
leaks by ensuring padding is initialized.
Originally-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20230809203144.2880050-1-stefanha@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This no longer has any callers, remove the function
Kevin noticed that after commit 99f98a7c0d ("iommufd: IOMMUFD_DESTROY
should not increase the refcount") there was only one other user and it
turns out the rework in commit 9227da7816 ("iommufd: Add
iommufd_access_change_ioas(_id) helpers") got rid of the last one.
Link: https://lore.kernel.org/r/0-v1-abb31bedd888+c1-iommufd_ref_to_users_jgg@nvidia.com
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmTZISMeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGP+kH/RJWesO8dQ1b2jRh
v1dexbytGUykROpmHBnJKDznwsSBnhDlI9Tu62dumWKRrCzwZto8Hag1QC2jYrra
x7f3W087HdTSh3j5B92kGK/ZXgm4NwjVI078ujSv/e+qJMB3behpdL7uUkFUeeVV
OaDhlSL4ILlyVOYPX3sHMiPutmZcXxe8/25o4aylpBrzlClKen7OODRz6gIwyVOR
Nufgi/H5bkB4rDLOVI87HrxQMSpCtyGJtjTB78e/aRvIwYhJq16iuq+uBqOxQqgr
anlg1nJ3r6/LphiT9H63xNFwIJDxtL7I1V8CQ9Jyvf/O4MNGSaM7sHw2l8ujTxU9
hf4GYyY=
=loC2
-----END PGP SIGNATURE-----
Merge tag 'v6.5-rc6' into iommufd for-next
Required for following patches.
Resolve merge conflict by using the hunk from the for-next branch and
shifting the iommufd_object_deref_user() into iommufd_hw_pagetable_put()
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
I've avoided doing this because there is no way to make this happen
without an intrusion into the core code. Up till now this has avoided
needing the core code's probe path with some hackery - but now that
default domains are becoming mandatory it is unavoidable.
This became a serious problem when the core code stopped allowing
partially registered iommu drivers in commit 14891af379 ("iommu: Move
the iommu driver sysfs setup into iommu_init/deinit_device()") which
breaks the selftest. That series was developed along with a second series
that contained this patch so it was not noticed.
Make it so that iommufd selftest can create a real iommu driver and bind
it only to is own private bus. Add iommu_device_register_bus() as a core
code helper to make this possible. It simply sets the right pointers and
registers the notifier block. The mock driver then works like any normal
driver should, with probe triggered by the bus ops
When the bus->iommu_ops stuff is fully unwound we can probably do better
here and remove this special case.
Link: https://lore.kernel.org/r/15-v6-e8114faedade+425-iommu_all_defdom_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Add a new IOMMU_TEST_OP_ACCESS_REPLACE_IOAS to allow replacing the
access->ioas, corresponding to the iommufd_access_replace() helper.
Then add replace coverage as a part of user_copy test case, which
basically repeats the copy test after replacing the old ioas with a new
one.
Link: https://lore.kernel.org/r/a4897f93d41c34b972213243b8dbf4c3832842e4.1690523699.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Taking advantage of the new iommufd_access_change_ioas_id helper, add an
iommufd_access_replace() API for the VFIO emulated pathway to use.
Link: https://lore.kernel.org/r/a3267b924fd5f45e0d3a1dd13a9237e923563862.1690523699.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Update iommufd_access_destroy_object() to call the new
iommufd_access_change_ioas() helper.
It is impossible to legitimately race iommufd_access_destroy_object() with
iommufd_access_change_ioas() as iommufd_access_destroy_object() is only
called once the refcount reache zero, so any concurrent
iommufd_access_change_ioas() is already UAFing the memory.
Link: https://lore.kernel.org/r/f9fbeca2cde7f8515da18d689b3e02a6a40a5e14.1690523699.git.nicolinc@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The complication of the mutex and refcount will be amplified after we
introduce the replace support for accesses. So, add a preparatory change
of a constitutive helper iommufd_access_change_ioas() and its wrapper
iommufd_access_change_ioas_id(). They can simply take care of existing
iommufd_access_attach() and iommufd_access_detach(), properly sequencing
the refcount puts so that they are truely at the end of the sequence after
we know the IOAS pointer is not required any more.
Link: https://lore.kernel.org/r/da0c462532193b447329c4eb975a596f47e49b70.1690523699.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This is a preparatory change for ioas replacement support for accesses.
The replacement routine does an iopt_add_access() for a new IOAS first and
then iopt_remove_access() for the old IOAS upon the success of the first
call. However, the first call overrides the iopt_access_list_id in the
access struct, resulting in iopt_remove_access() being unable to work on
the old IOAS.
Add an iopt_access_list_id as a parameter to iopt_remove_access, so the
replacement routine can save the id before it gets overwritten. Pass the
id in iopt_remove_access() for a proper cleanup.
The existing callers should just pass in access->iopt_access_list_id.
Link: https://lore.kernel.org/r/7bb939b9e0102da0c099572bb3de78ab7622221e.1690523699.git.nicolinc@nvidia.com
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
syzkaller found a race where IOMMUFD_DESTROY increments the refcount:
obj = iommufd_get_object(ucmd->ictx, cmd->id, IOMMUFD_OBJ_ANY);
if (IS_ERR(obj))
return PTR_ERR(obj);
iommufd_ref_to_users(obj);
/* See iommufd_ref_to_users() */
if (!iommufd_object_destroy_user(ucmd->ictx, obj))
As part of the sequence to join the two existing primitives together.
Allowing the refcount the be elevated without holding the destroy_rwsem
violates the assumption that all temporary refcount elevations are
protected by destroy_rwsem. Racing IOMMUFD_DESTROY with
iommufd_object_destroy_user() will cause spurious failures:
WARNING: CPU: 0 PID: 3076 at drivers/iommu/iommufd/device.c:477 iommufd_access_destroy+0x18/0x20 drivers/iommu/iommufd/device.c:478
Modules linked in:
CPU: 0 PID: 3076 Comm: syz-executor.0 Not tainted 6.3.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023
RIP: 0010:iommufd_access_destroy+0x18/0x20 drivers/iommu/iommufd/device.c:477
Code: e8 3d 4e 00 00 84 c0 74 01 c3 0f 0b c3 0f 1f 44 00 00 f3 0f 1e fa 48 89 fe 48 8b bf a8 00 00 00 e8 1d 4e 00 00 84 c0 74 01 c3 <0f> 0b c3 0f 1f 44 00 00 41 57 41 56 41 55 4c 8d ae d0 00 00 00 41
RSP: 0018:ffffc90003067e08 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888109ea0300 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000ffffffff
RBP: 0000000000000004 R08: 0000000000000000 R09: ffff88810bbb3500
R10: ffff88810bbb3e48 R11: 0000000000000000 R12: ffffc90003067e88
R13: ffffc90003067ea8 R14: ffff888101249800 R15: 00000000fffffffe
FS: 00007ff7254fe6c0(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555557262da8 CR3: 000000010a6fd000 CR4: 0000000000350ef0
Call Trace:
<TASK>
iommufd_test_create_access drivers/iommu/iommufd/selftest.c:596 [inline]
iommufd_test+0x71c/0xcf0 drivers/iommu/iommufd/selftest.c:813
iommufd_fops_ioctl+0x10f/0x1b0 drivers/iommu/iommufd/main.c:337
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x84/0xc0 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0x80 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
The solution is to not increment the refcount on the IOMMUFD_DESTROY path
at all. Instead use the xa_lock to serialize everything. The refcount
check == 1 and xa_erase can be done under a single critical region. This
avoids the need for any refcount incrementing.
It has the downside that if userspace races destroy with other operations
it will get an EBUSY instead of waiting, but this is kind of racing is
already dangerous.
Fixes: 2ff4bed7fe ("iommufd: File descriptor, context, kconfig and makefiles")
Link: https://lore.kernel.org/r/2-v1-85aacb2af554+bc-iommufd_syz3_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reported-by: syzbot+7574ebfe589049630608@syzkaller.appspotmail.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Now that we actually call iommufd_device_bind() we can return the
idev_id from that function to userspace for use in other APIs.
Link: https://lore.kernel.org/r/18-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This allows userspace to manually create HWPTs on IOAS's and then use
those HWPTs as inputs to iommufd_device_attach/replace().
Following series will extend this to allow creating iommu_domains with
driver specific parameters.
Link: https://lore.kernel.org/r/17-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Allow the selftest to call the function on the mock idev, add some tests
to exercise it.
Link: https://lore.kernel.org/r/16-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The selftest invokes things like replace under the object lock of its
idev which protects the idev in a similar way to a real user.
Unfortunately this triggers lockdep. A lock class per type will solve the
problem.
Link: https://lore.kernel.org/r/15-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Replace allows all the devices in a group to move in one step to a new
HWPT. Further, the HWPT move is done without going through a blocking
domain so that the IOMMU driver can implement some level of
non-distruption to ongoing DMA if that has meaning for it (eg for future
special driver domains)
Replace uses a lot of the same logic as normal attach, except the actual
domain change over has different restrictions, and we are careful to
sequence things so that failure is going to leave everything the way it
was, and not get trapped in a blocking domain or something if there is
ENOMEM.
Link: https://lore.kernel.org/r/14-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The code flow for first time attaching a PT and replacing a PT is very
similar except for the lowest do_attach step.
Reorganize this so that the do_attach step is a function pointer.
Replace requires destroying the old HWPT once it is replaced. This
destruction cannot be done under all the locks that are held in the
function pointer, so the signature allows returning a HWPT which will be
destroyed by the caller after everything is unlocked.
Link: https://lore.kernel.org/r/12-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Due to the auto_domains mechanism the ioas->mutex must be held until
the hwpt is completely setup by iommufd_object_abort_and_destroy() or
iommufd_object_finalize().
This prevents a concurrent iommufd_device_auto_get_domain() from seeing
an incompletely initialized object through the ioas->hwpt_list.
To make this more consistent move the unlock until after finalize.
Fixes: e8d5721003 ("iommufd: Add kAPI toward external drivers for physical devices")
Link: https://lore.kernel.org/r/11-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
During creation the hwpt must have the ioas->mutex held until the object
is finalized. This means we need to be able to call
iommufd_object_abort_and_destroy() while holding the mutex.
Since iommufd_hw_pagetable_destroy() also needs the mutex this is
problematic.
Fix it by creating a special abort op for the object that can assume the
caller is holding the lock, as required by the contract.
The next patch will add another iommufd_object_abort_and_destroy() for a
hwpt.
Fixes: e8d5721003 ("iommufd: Add kAPI toward external drivers for physical devices")
Link: https://lore.kernel.org/r/10-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Logically the HWPT should have the coherency set properly for the device
that it is being created for when it is created.
This was happening implicitly if the immediate_attach was set because
iommufd_hw_pagetable_attach() does it as the first thing.
Do it unconditionally so !immediate_attach works properly.
Link: https://lore.kernel.org/r/9-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Next patch will need to call this from two places.
Link: https://lore.kernel.org/r/8-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The sw_msi_start is only set by the ARM drivers and it is always constant.
Due to the way vfio/iommufd allow domains to be re-used between
devices we have a built in assumption that there is only one value
for sw_msi_start and it is global to the system.
To make replace simpler where we may not reparse the
iommu_get_resv_regions() move the sw_msi_start to the iommufd_group so it
is always available once any HWPT has been attached.
Link: https://lore.kernel.org/r/7-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This only needs to be done once per group, not once per device. The once
per device was a way to make the device list work. Since we are abandoning
this we can optimize things a bit.
Link: https://lore.kernel.org/r/6-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The driver facing API in the iommu core makes the reserved regions
per-device. An algorithm in the core code consolidates the regions of all
the devices in a group to return the group view.
To allow for devices to be hotplugged into the group iommufd would re-load
the entire group's reserved regions for each device, just in case they
changed.
Further iommufd already has to deal with duplicated/overlapping reserved
regions as it must union all the groups together.
Thus simplify all of this to just use the device reserved regions
interface directly from the iommu driver.
Link: https://lore.kernel.org/r/5-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The devices list was used as a simple way to avoid having per-group
information. Now that this seems to be unavoidable, just commit to
per-group information fully and remove the devices list from the HWPT.
The iommufd_group stores the currently assigned HWPT for the entire group
and we can manage the per-device attach/detach with a list in the
iommufd_group.
For destruction the flow is organized to make the following patches
easier, the actual call to iommufd_object_destroy_user() is done at the
top of the call chain without holding any locks. The HWPT to be destroyed
is returned out from the locked region to make this possible. Later
patches create locking that requires this.
Link: https://lore.kernel.org/r/3-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
When the hwpt to device attachment is fairly static we could get away with
the simple approach of keeping track of the groups via a device list. But
with replace this is infeasible.
Add an automatically managed struct that is 1:1 with the iommu_group
per-ictx so we can store the necessary tracking information there.
Link: https://lore.kernel.org/r/2-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
With the recent rework this no longer needs to be done at domain
attachment time, we know if the device is usable by iommufd when we bind
it.
The value of msi_device_has_isolated_msi() is not allowed to change while
a driver is bound.
Link: https://lore.kernel.org/r/1-v8-6659224517ea+532-iommufd_alloc_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
vfio_group is not needed for vfio device cdev, so with vfio device cdev
introduced, the vfio_group infrastructures can be compiled out if only
cdev is needed.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20230718135551.6592-26-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
It's common to get a reference to the iommufd context from a given file
descriptor. So adds an API for it. Existing users of this API are compiled
only when IOMMUFD is enabled, so no need to have a stub for the IOMMUFD
disabled case.
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20230718135551.6592-21-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Previously, the detach routine is only done by the destroy(). And it was
called by vfio_iommufd_emulated_unbind() when the device runs close(), so
all the mappings in iopt were cleaned in that setup, when the call trace
reaches this detach() routine.
Now, there's a need of a detach uAPI, meaning that it does not only need
a new iommufd_access_detach() API, but also requires access->ops->unmap()
call as a cleanup. So add one.
However, leaving that unprotected can introduce some potential of a race
condition during the pin_/unpin_pages() call, where access->ioas->iopt is
getting referenced. So, add an ioas_lock to protect the context of iopt
referencings.
Also, to allow the iommufd_access_unpin_pages() callback to happen via
this unmap() call, add an ioas_unpin pointer, so the unpin routine won't
be affected by the "access->ioas = NULL" trick.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20230718135551.6592-15-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This is needed by the vfio-pci driver to report affected devices in the
hot-reset for a given device.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20230718105542.4138-6-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This adds the helper to check if any device within the given iommu_group
has been bound with the iommufd_ctx. This is helpful for the checking on
device ownership for the devices which have not been bound but cannot be
bound to any other iommufd_ctx as the iommu_group has been bound.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20230718105542.4138-5-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With this reservation, IOMMUFD users can encode the negative IDs for
specific purposes. e.g. VFIO needs two reserved values to tell userspace
the ID returned is not valid but has other meaning.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20230718105542.4138-4-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Just two RC syzkaller fixes, both for the same basic issue, using the area
pointer during an access forced unmap while the locks protecting it were
let go.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZJmGygAKCRCFwuHvBreF
YVNSAQC7SgejTvwD6EYXr8AUDko1v0G0M/o60OrWIuC7xWiFPQD/RDwtItRLzf4h
i+YCfMtn/7IB/uV/sRTF4m0HzudcDAM=
=0fm4
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"Just two syzkaller fixes, both for the same basic issue: using the
area pointer during an access forced unmap while the locks protecting
it were let go"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
iommufd: Call iopt_area_contig_done() under the lock
iommufd: Do not access the area pointer after unlocking
No invocation of pin_user_pages_remote() uses the vmas parameter, so
remove it. This forms part of a larger patch set eliminating the use of
the vmas parameters altogether.
Link: https://lkml.kernel.org/r/28f000beb81e45bf538a2aaa77c90f5482b67a32.1684350871.git.lstoakes@gmail.com
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The following selftest patch requires both the bug fixes and the
improvements of the selftest framework.
* iommufd/for-rc:
iommufd: Do not corrupt the pfn list when doing batch carry
iommufd: Fix unpinning of pages when an access is present
iommufd: Check for uptr overflow
Linux 6.3-rc5
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
smatch reports:
drivers/iommu/iommufd/selftest.c:295:21: warning: symbol
'mock_iommu_device' was not declared. Should it be static?
This variable is only used in one file so it should be static.
Fixes: 65c619ae06 ("iommufd/selftest: Make selftest create a more complete mock device")
Link: https://lore.kernel.org/r/20230404002317.1912530-1-trix@redhat.com
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
syzkaller found that the calculation of batch_last_index should use
'start_index' since at input to this function the batch is either empty or
it has already been adjusted to cross any accesses so it will start at the
point we are unmapping from.
Getting this wrong causes the unmap to run over the end of the pages
which corrupts pages that were never mapped. In most cases this triggers
the num pinned debugging:
WARNING: CPU: 0 PID: 557 at drivers/iommu/iommufd/pages.c:294 __iopt_area_unfill_domain+0x152/0x560
Modules linked in:
CPU: 0 PID: 557 Comm: repro Not tainted 6.3.0-rc2-eeac8ede1755 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:__iopt_area_unfill_domain+0x152/0x560
Code: d2 0f ff 44 8b 64 24 54 48 8b 44 24 48 31 ff 44 89 e6 48 89 44 24 38 e8 fc d3 0f ff 45 85 e4 0f 85 eb 01 00 00 e8 0e d2 0f ff <0f> 0b e8 07 d2 0f ff 48 8b 44 24 38 89 5c 24 58 89 18 8b 44 24 54
RSP: 0018:ffffc9000108baf0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffffffff821e3f85
RDX: 0000000000000000 RSI: ffff88800faf0000 RDI: 0000000000000002
RBP: ffffc9000108bd18 R08: 000000000003ca25 R09: 0000000000000014
R10: 000000000003ca00 R11: 0000000000000024 R12: 0000000000000004
R13: 0000000000000801 R14: 00000000000007ff R15: 0000000000000800
FS: 00007f3499ce1740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000243 CR3: 00000000179c2001 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
iopt_area_unfill_domain+0x32/0x40
iopt_table_remove_domain+0x23f/0x4c0
iommufd_device_selftest_detach+0x3a/0x90
iommufd_selftest_destroy+0x55/0x70
iommufd_object_destroy_user+0xce/0x130
iommufd_destroy+0xa2/0xc0
iommufd_fops_ioctl+0x206/0x330
__x64_sys_ioctl+0x10e/0x160
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
Also add some useful WARN_ON sanity checks.
Cc: <stable@vger.kernel.org>
Fixes: 8d160cd4d5 ("iommufd: Algorithms for PFN storage")
Link: https://lore.kernel.org/r/2-v1-ceab6a4d7d7a+94-iommufd_syz_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Yi Liu says
===================
The .bind_iommufd op of vfio emulated devices are either empty or does
nothing. This is different with the vfio physical devices, to add vfio
device cdev, need to make them act the same.
This series first makes the .bind_iommufd op of vfio emulated devices to
create iommufd_access, this introduces a new iommufd API. Then let the
driver that does not provide .bind_iommufd op to use the vfio emulated
iommufd op set. This makes all vfio device drivers have consistent iommufd
operations, which is good for adding new device uAPIs in the device cdev
===================
* branch 'vfio_mdev_ops':
vfio: Check the presence for iommufd callbacks in __vfio_register_dev()
vfio/mdev: Uses the vfio emulated iommufd ops set in the mdev sample drivers
vfio-iommufd: Make vfio_iommufd_emulated_bind() return iommufd_access ID
vfio-iommufd: No need to record iommufd_ctx in vfio_device
iommufd: Create access in vfio_iommufd_emulated_bind()
iommu/iommufd: Pass iommufd_ctx pointer in iommufd_get_ioas()
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
vfio device cdev needs to return iommufd_access ID to userspace if
bind_iommufd succeeds.
Link: https://lore.kernel.org/r/20230327093351.44505-5-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
There are needs to created iommufd_access prior to have an IOAS and set
IOAS later. Like the vfio device cdev needs to have an iommufd object
to represent the bond (iommufd_access) and IOAS replacement.
Moves the iommufd_access_create() call into vfio_iommufd_emulated_bind(),
making it symmetric with the __vfio_iommufd_access_destroy() call in the
vfio_iommufd_emulated_unbind(). This means an access is created/destroyed
by the bind()/unbind(), and the vfio_iommufd_emulated_attach_ioas() only
updates the access->ioas pointer.
Since vfio_iommufd_emulated_bind() does not provide ioas_id, drop it from
the argument list of iommufd_access_create(). Instead, add a new access
API iommufd_access_attach() to set the access->ioas pointer. Also, set
vdev->iommufd_attached accordingly, similar to the physical pathway.
Link: https://lore.kernel.org/r/20230327093351.44505-3-yi.l.liu@intel.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
No need to pass the iommufd_ucmd pointer.
Link: https://lore.kernel.org/r/20230327093351.44505-2-yi.l.liu@intel.com
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
iommufd wants to use more infrastructure, like the iommu_group, that the
mock device does not support. Create a more complete mock device that can
go through the whole cycle of ownership, blocking domain, and has an
iommu_group.
This requires creating a real struct device on a real bus to be able to
connect it to a iommu_group. Unfortunately we cannot formally attach the
mock iommu driver as an actual driver as the iommu core does not allow
more than one driver or provide a general way for busses to link to
iommus. This can be solved with a little hack to open code the dev_iommus
struct.
With this infrastructure things work exactly the same as the normal domain
path, including the auto domains mechanism and direct attach of hwpts. As
the created hwpt is now an autodomain it is no longer required to destroy
it and trying to do so will trigger a failure.
Link: https://lore.kernel.org/r/11-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
It is too confusing now that we have the 'dev_id' as part of the main
interface. Make it clear this is the special selftest device object. This
object is analogous to the VFIO device FD.
Link: https://lore.kernel.org/r/7-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The HWPT is always linked to an IOAS and once a HWPT exists its domain
should be fully mapped. This ended up being split up into device.c during
a two phase creation that was a bit confusing.
Move the iopt_table_add_domain() into iommufd_hw_pagetable_alloc() by
having it call back to device.c to complete the domain attach in the
required order.
Calling iommufd_hw_pagetable_alloc() with immediate_attach = false will
work on most drivers, but notably the SMMU drivers will fail because they
can't decide what kind of domain to create until they are attached. This
will be fixed when the domain_alloc function can take in a struct device.
Link: https://lore.kernel.org/r/6-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
A HWPT is permanently associated with an IOAS when it is created, remove
the strange situation where a refcount != 0 HWPT can have been
disconnected from the IOAS by putting all the IOAS related destruction in
the object destroy function.
Initializing a HWPT is two stages, we have to allocate it, attach it to a
device and then populate the domain. Once the domain is populated it is
fully linked to the IOAS.
Arrange things so that all the error unwinds flow through the
iommufd_hw_pagetable_destroy() and allow it to handle all cases.
Link: https://lore.kernel.org/r/4-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
This should be added immediately after every iopt_table_add_domain(), and
deleted after every iopt_table_remove_domain() under the ioas->mutex.
Tidy things to be consistent.
Link: https://lore.kernel.org/r/3-v3-ae9c2975a131+2e1e8-iommufd_hwpt_jgg@nvidia.com
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>