Rename vram_scratch into mem_scratch and allow allocating it into GTT as
well.
The only problem with that is that we won't have a default page for the
system aperture any more.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Technically all of those can use GTT as well, no need to force things
into VRAM.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
1.Drop soft reset check when do should recover gpu check.
(As it will skip gpu reset operation if some ip is hang but
not support soft reset)
2.Check soft reset status before do soft reset when pre asic reset.
a. If check soft reset return true, it means: some ip is hang and
it also support soft reset, will try soft reset first.
b. If check soft reset return false, it means:
I. All the ip are not hang, will skip gpu reset.
II. Some ip is hang but not support soft reset, will skip soft
reset and retry with full reset later.
Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It's also part of gfxoff.
Cc: stable@vger.kernel.org # 6.0, 6.1
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It's also part of gfxoff.
Cc: stable@vger.kernel.org # 6.0, 6.1
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit f543d28687.
This is no longer needed since we no longer touch SDMA 5.x for s0i3.
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
SDMA 5.x is part of the GFX block so it's controlled via
GFXOFF. Skip suspend as it should be handled the same
as GFX.
v2: drop SDMA 4.x. That requires special handling.
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Acked-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
It's handled by GFXOFF for SDMA 5.x and SMU saves the state on
SDMA 4.x.
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
- evict_resource is taking too long causing sriov full access mode timeout.
So, add an extra evict_resource in the beginning as an early evict.
Signed-off-by: Shikang Fan <shikang.fan@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Instead of a single worker going over the list of delete BOs in regular
intervals use a per BO worker which blocks for the resv object and
locking of the BO.
This not only simplifies the handling massively, but also results in
much better response time when cleaning up buffers.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221125102137.1801-3-christian.koenig@amd.com
Those functions never worked correctly since it is still perfectly
possible that a buffer object is released and the background worker
restarted even after calling them.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221125102137.1801-2-christian.koenig@amd.com
In amdgpu_get_xgmi_hive(), we should not call kfree() after
kobject_put() as the PUT will call kfree().
In amdgpu_device_ip_init(), we need to check the returned *hive*
which can be NULL before we dereference it.
Signed-off-by: Liang He <windhl@126.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
- in device_resume, sriov configure interrupt should be in full access,
so release_full_gpu should be done after kfd_resume.
- remove the previous workaround solution for sriov.
Fixes: ec4927d463 ("drm/amdgpu: fix for suspend/resume sequence under sriov")
Signed-off-by: Shikang Fan <shikang.fan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
As comment of pci_get_domain_bus_and_slot() says, it returns
a pci device with refcount increment, when finish using it,
the caller must decrement the reference count by calling
pci_dev_put().
So before returning from amdgpu_device_resume|suspend_display_audio(),
pci_dev_put() is called to avoid refcount leak.
Fixes: 3f12acc8d6 ("drm/amdgpu: put the audio codec into suspend state before gpu reset V3")
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
amd-drm-next-6.2-2022-11-18:
amdgpu:
- SR-IOV fixes
- Clean up DC checks
- DCN 3.2.x fixes
- DCN 3.1.x fixes
- Don't enable degamma on asics which don't support it
- IP discovery fixes
- BACO fixes
- Fix vbios allocation handling when vkms is enabled
- Drop buggy tdr advanced mode GPU reset handling
- Fix the build when DCN is not set in kconfig
- MST DSC fixes
- Userptr fixes
- FRU and RAS EEPROM fixes
- VCN 4.x RAS support
- Aldrebaran CU occupancy reporting fix
- PSP ring cleanup
amdkfd:
- Memory limit fix
- Enable cooperative launch on gfx 10.3
amd-drm-next-6.2-2022-11-11:
amdgpu:
- SMU 13.x updates
- GPUVM TLB race fix
- DCN 3.1.4 updates
- DCN 3.2.x updates
- PSR fixes
- Kerneldoc fix
- Vega10 fan fix
- GPUVM locking fixes in error pathes
- BACO fix for Beige Goby
- EEPROM I2C address cleanup
- GFXOFF fix
- Fix DC memory leak in error pathes
- Flexible array updates
- Mtype fix for GPUVM PTEs
- Move Kconfig into amdgpu directory
- SR-IOV updates
- Fix possible memory leak in CS IOCTL error path
amdkfd:
- Fix possible memory overrun
- CRIU fixes
radeon:
- ACPI ref count fix
- HDA audio notifier support
- Move Kconfig into radeon directory
UAPI:
- Add new GEM_CREATE flags to help to transition more KFD functionality to the DRM UAPI.
These are used internally in the driver to align location based memory coherency
requirements from memory allocated in the KFD with how we manage GPUVM PTEs. They
are currently blocked in the GEM_CREATE IOCTL as we don't have a user right now.
They are just used internally in the kernel driver for now for existing KFD memory
allocations. So a change to the UAPI header, but no functional change in the UAPI.
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221118170807.6505-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
The patch is enabling mode-1 reset for RAS recovery in fatal error mode.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The state of VRAM is unreliable due to a PCI event like AER, link reset
or DPC.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Re-submitting IBs by the kernel has many problems because pre-
requisite state is not automatically re-created as well. In
other words neither binary semaphores nor things like ring
buffer pointers are in the state they should be when the
hardware starts to work on the IBs again.
Additional to that even after more than 5 years of
developing this feature it is still not stable and we have
massively problems getting the reference counts right.
As discussed with user space developers this behavior is not
helpful in the first place. For graphics and multimedia
workloads it makes much more sense to either completely
re-create the context or at least re-submitting the IBs
from userspace.
For compute use cases re-submitting is also not very
helpful since userspace must rely on the accuracy of
the result.
Because of this we stop this practice and instead just
properly note that the fence submission was canceled. The
only use case we keep the re-submission for now is SRIOV
and function level resets.
v2: as suggested by Sshaoyun stop resubmitting jobs even for SRIOV
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit e6c6338f39.
This feature basically re-submits one job after another to
figure out which one was the one causing a hang.
This is obviously incompatible with gang-submit which requires
that multiple jobs run at the same time. It's also absolutely
not helpful to crash the hardware multiple times if a clean
recovery is desired.
For testing and debugging environments we should rather disable
recovery alltogether to be able to inspect the state with a hw
debugger.
Additional to that the sw implementation is clearly buggy and causes
reference count issues for the hardware fence.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Amdgpu_ras_set_error_query_ready is called at the start of
amdgpu_device_gpu_recover to disable query ras error, but the
code behind only enables query ras error in full reset path,
but not in soft reset path, emergency restart path and skip
the hardware reset path.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If we enable virtual display functionality on parts with
no display hardware we can end up trying to check for and
reserve the vbios FB area on devices where it doesn't exist.
Check if display hardware is actually present on the hardware
before trying to reserve the memory.
v2: move the check into common code
Acked-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
There are several places where we don't want to check
if a particular asic could support DC, but rather, if
DC is enabled. Set a flag if DC is enabled and check
for that rather than if a device supports DC or not.
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
virtual display is enabled unconditionally in SR-IOV, but
without specifying the virtual_display module, the number
of crtcs defaults to 0. Set a single display by default
for SR-IOV if the virtual_display parameter is not set.
Only enable virtual display by default on SR-IOV on asics
which actually have display hardware.
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Remove include statements for <drm/drm_fb_helper.h> where it is not
required (i.e., most of them). In a few places include other header
files that are required by the source code.
v3:
* fix amdgpu include statements
* fix rockchip include statements
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221103151446.2638-23-tzimmermann@suse.de
- clear kiq ring after suspend/resume under sriov to aviod kiq ring
test failure
- update irq after resume to fix kiq interrput loss
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[why]
Start from soc21, CP does not support MCBP, so disable it.
[how]
Used amgpu_mcbp flag alone instead of checking if is in SRIOV to
enable/disable MCBP.
Only set flag to enable on asic_type prior to soc21 in SRIOV.
Signed-off-by: Yiqing Yao <yiqing.yao@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If a system does not have swap and memory is under 100% usage,
amdgpu will fail to evict resources. Currently the suspend
carries on proceeding to reset the GPU:
```
[drm] evicting device resources failed
[drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
[drm] free PSP TMR buffer
[TTM] Failed allocating page table
[drm] evicting device resources failed
amdgpu 0000:03:00.0: amdgpu: MODE1 reset
amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
```
At this point if the suspend actually succeeded I think that amdgpu
would have recovered because the GPU would have power cut off and
restored. However the kernel fails to continue the suspend from the
memory pressure and amdgpu fails to run the "resume" from the aborted
suspend.
```
ACPI: PM: Preparing to enter system sleep state S3
SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
node 0: slabs: 22, objs: 1122, free: 0
ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)
[drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
[drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
[drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
amdgpu 0000:03:00.0: PM: failed to resume async: error -62
```
To avoid this series of unfortunate events, fail amdgpu's suspend
when the memory eviction fails. This will let the system gracefully
recover and the user can try suspend again when the memory pressure
is relieved.
Reported-by: post@davidak.de
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In the S2idle suspend/resume phase the gfxoff is keeping functional so
some IP blocks will be likely to reinitialize at gfxoff entry and that
will result in failing to program GC registers.Therefore, let disallow
gfxoff until AMDGPU IPs reinitialized completely.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Suggested by PMFW team and same as what did for gfxoff feature.
This can address some Mode1Reset failures observed on SMU13.0.0.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.0.x
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This reverts commit dac6b80818.
This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.
Fixes: dac6b80818 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Suggested by PMFW team and same as what did for gfxoff feature.
This can address some Mode1Reset failures observed on SMU13.0.0.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.0.x
This reverts commit dac6b80818.
This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.
Fixes: dac6b80818 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
- Under SRIOV, we need to send REQ_GPU_FINI to the hypervisor
during the suspend time. Furthermore, we cannot request a
mode 1 reset under SRIOV as VF. Therefore, we will skip it
as it is called in suspend_noirq() function.
- In the resume code path, we need to send REQ_GPU_INIT to the
hypervisor and also resume PSP IP block under SRIOV.
Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use the simpified API that calculates distance between two devices.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Disable verbose while getting p2p distance. With verbose, it shows
warning if ACS redirect is set between the devices. Adds noise
to dmesg logs when a few GPU devices are on the same platform.
Example log:
amdgpu 0000:34:00.0: ACS redirect is set between the client and provider (0000:31:00.0)
amdgpu 0000:34:00.0: to disable ACS redirect for this path, add the kernel parameter:
pci=disable_acs_redir=0000:30:00.0;0000:2e:00.0;0000:33:00.0;0000:2e:10.0
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Allows submitting jobs as gang which needs to run on multiple
engines at the same time.
Basic idea is that we have a global gang submit fence representing when the
gang leader is finally pushed to run on the hardware last.
Jobs submitted as gang are never re-submitted in case of a GPU reset since this
won't work and will just deadlock the hardware immediately again.
v2: fix logic inversion, improve documentation, fix rcu
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
V3:
Fixed psp fence and memory issues for the asic
using smu v13_0_2 when removing amdgpu device.
[Why]:
1. psp_suspend->psp_free_shared_bufs->
psp_ta_free_shared_buf->
amdgpu_bo_free_kernel->
...->amdgpu_bo_release_notify->
amdgpu_fill_buffer
psp will free vram memory used by psp when psp_suspend
is called. But for the asic using smu v13_0_2, because
psp_suspend is called before adev->shutdown is set to
true when removing the first hive device, amdgpu fill_buffer
will be called, which will cause fence issues when evicting
all vram resources in amdgpu vram mgr_fini.
2. Since psp_hw_fini is not called after calling psp_suspend
and psp_suspend only calls psp_ring_stop, the psp ring memory
will not be released when amdgpu device is removed.
[How]:
1. Set shutdown to true before calling amdgpu_device_gpu_recover,
then amdgpu_fill_buffer will not be called when psp_suspend is
called.
2. Free psp ring memory in psp_sw_fini.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Adjust removal control flow for smu v13_0_2:
During amdgpu uninstallation, when removing the first
device, the kernel needs to first send a mode1reset message
to all gpu devices. Otherwise, smu initialization will fail
the next time amdgpu is installed.
V2:
1. Update commit comments.
2. Remove the global variable amdgpu_device_remove_cnt
and add a variable to the structure amdgpu_hive_info.
3. Use hive to detect the first removed device instead of
a global variable.
V3:
1. Update commit comments.
2. Split a patch into multiple patches.
3. The current patch does:
a. Add a work mode of AMDGPU_RESET_FOR_DEVICE_REMOVE into
the existing gpu recover path, which make all devices
in hive list only have HW reset but no resume (except
the base IP).
b. Call AMDGPU_RESET_FOR_DEVICE_REMOVE and
AMDGPU_NEED_FULL_RESET mode of amdgpu_device_gpu_recover
in amdgpu_pci_remove when removing the first device in
hive list.
c. When removing the first device, the IP blocks keyword
function call sequence is as follows:
.suspend->mode1reset->.resume(basic ip)->.hw_fini->.early_fini->.sw_fini.
^ |
|-<----------<---------<----|
The first three sequences are because of a call to
amdgpu_device_gpu_recover. The three sequences will be
executed in a loop until all devices in the hive list
are iterated.
The sequences starting from .hw_fini only apply to the
first device. Since .suspend has been called before,
except the resumed phase1 basic ip blocks, all other ip
blocks .hw_fini of current device will do nothing.
d. When removing other devices, the calling sequences is the
same as legacy:
.hw_fini -> .early_fini -> .sw_fini.
Since .suspend has been called when removing the first device,
except the resumed phase1 basic ip blocks, all of other ip
blocks .hw_fini of current device will do nothing.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Move common IP init before GMC init so that HDP gets
remapped before GMC init which uses it.
This fixes the Unsupported Request error reported through
AER during driver load. The error happens as a write happens
to the remap offset before real remapping is done.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216373
The error was unnoticed before and got visible because of the commit
referenced below. This doesn't fix anything in the commit below, rather
fixes the issue in amdgpu exposed by the commit. The reference is only
to associate this commit with below one so that both go together.
Fixes: 8795e182b0 ("PCI/portdrv: Don't disable AER reporting in get_port_device_capability()")
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
both get_xgmi_hive and put_xgmi_hive can be skipped since the
reset domain is not necessary for VF
Signed-off-by: Vignesh Chander <Vignesh.Chander@amd.com>
Reviewed-by: Shaoyun Liu <Shaoyun.Liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
For SRIOV configuration, host driver control the reset method(either FLR or
heavier chain reset). The host will notify the guest individually with FLR
message if individual GPU within the hive need to be reset. So for guest
side, no need to use hive->reset_domain to replace the original per
device reset_domain
Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
V1:
The psp_cmd_submit_buf function is called by psp_hw_fini to send
TA unload messages to psp to terminate ras, asd and tmr. But when
amdgpu is uninstalled, drm_dev_unplug is called earlier than
psp_hw_fini in amdgpu_pci_remove, the calling order as follows:
static void amdgpu_pci_remove(struct pci_dev *pdev) {
drm_dev_unplug
......
amdgpu_driver_unload_kms->amdgpu_device_fini_hw->...
->.hw_fini->psp_hw_fini->...
->psp_ta_unload->psp_cmd_submit_buf
......
}
The program will return when calling drm_dev_enter in psp_cmd_submit_buf.
So the call to drm_dev_enter in psp_cmd_submit_buf should be
removed, so that the TA unload messages can be sent to the psp
when amdgpu is uninstalled.
V2:
1. Restore psp_cmd_submit_buf to its original code.
2. Move drm_dev_unplug call after amdgpu_driver_unload_kms in
amdgpu_pci_remove.
3. Since amdgpu_device_fini_hw is called by amdgpu_driver_unload_kms,
remove the unplug check to release device mmio resource in
amdgpu_device_fini_hw before calling drm_dev_unplug.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[Why] Devices with CPU XGMI iolink do not support PCIe peer access.
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The additional call is caused by merge conflict
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Only amdgpu_get_xgmi_hive but no amdgpu_put_xgmi_hive
which will leak the hive reference.
Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add debugfs interface to log GFXOFF statistics:
- Read amdgpu_gfxoff_count to get the total GFXOFF entry count at the
time of query since system power-up
- Write 1 to amdgpu_gfxoff_residency to start logging, and 0 to stop.
Read it to get average GFXOFF residency % multiplied by 100
during the last logging interval.
Both features are designed to be keep the values persistent between
suspends.
Signed-off-by: André Almeida <andrealmeid@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
For some hang caused by slow tests, engine cannot be stopped which
may cause resume failure after reset. In this case, force halt
engine by reverting context addresses
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
- introduce AMDGPU_SKIP_MODE2_RESET flag
- let mode2 reset fallback to default reset method if failed
v2: move this part out from the asic specific part
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
A list of devices to be reset is already created in
amdgpu_device_gpu_recover function. Creating another list with the
same nodes is incorrect and not supported in list_head. Instead, pass
the device list as part of reset context.
Fixes: 9e08564727 (drm/amdgpu: Refactor mode2 reset logic for v13.0.2)
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
mes self test rely on vm mapping, move it after
drm sched re-started so that vm mapping can work
during gpu reset.
Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-and-tested-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This extra call trace dump comes out in every gpu reset.
And it gives people a wrong impression that something
went wrong. Although actually there was not.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Move reset_context out of gpu recover function to make it configurable
for different reset purpose.
For the reset way of call gpu_recovery sysfs, force to use full reset
method. Otherwise, try soft reset by default if the related ASIC
supportted, if soft reset failed, will use full reset.
Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Fixes this issue:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5094: warning: expecting prototype for amdgpu_device_gpu_recover_imp(). Prototype was for amdgpu_device_gpu_recover() instead
Fixes: cf72704414 ("drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover")
Reviewed-by: Kent Russell <kent.russell@amd.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Change amdggpu to amdgpu and pedning to pending
Signed-off-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Align refcount behaviour for amdgpu_job embedded HW fence with
classic pointer style HW fences by increasing refcount each
time emit is called so amdgpu code doesn't need to make workarounds
using amdgpu_job.job_run_counter to keep the HW fence refcount balanced.
Also since in the previous patch we resumed setting s_fence->parent to NULL
in drm_sched_stop switch to directly checking if job->hw_fence is
signaled to short circuit reset if already signed.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Tested-by: Yiqing Yao <yiqing.yao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Problem:
After we start handling timed out jobs we assume there fences won't be
signaled but we cannot be sure and sometimes they fire late. We need
to prevent concurrent accesses to fence array from
amdgpu_fence_driver_clear_job_fences during GPU reset and amdgpu_fence_process
from a late EOP interrupt.
Fix:
Before accessing fence array in GPU disable EOP interrupt and flush
all pending interrupt handlers for amdgpu device's interrupt line.
v2: Switch from irq_get/put to full enable/disable_irq for amdgpu
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use the correct adev variable for the drm_fb_helper in
amdgpu_device_gpu_recover(). Noticed by inspection.
Fixes: 087451f372 ("drm/amdgpu: use generic fb helpers instead of setting up AMD own's.")
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
enable_mes and enable_mes_kiq are set in both device init and
MES IP init. Leave the ones in MES IP init, since it is
a more accurate way to judge from GC IP version.
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Jack Xiao <Jack.Xiao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
We skip rest requests if another one is already in progress.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
We need to have a work_struct to cancel this reset if another
already in progress.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Will be read by executors of async reset like debugfs.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add support for peer-to-peer communication among AMD GPUs over PCIe
bus. Support REQUIRES enablement of config HSA_AMD_P2P.
Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Allocate memory for register value and use the same values for devcoredump.
v1 -> v2: Change krealloc_array() to kmalloc_array()
v2 -> v3: Fix alignment
Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Reviewed-by: Shashank Sharma <Shashank.sharma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
LVDS support was implemented in DC a while ago. Just DAC
support is left to do.
Reviewed-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Drop all of the extra cases in the default case.
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Removed trailing whitespace from end of line in amdgpu_device.c
Signed-off-by: Mitchell Augustin <kernel@mitchellaugustin.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
To enable TMZ feature based on IP version needs adev->ip_version
populated but its empty. Move amdgpu_gmc_tmz_set to a place where
ip_version is populated.
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
support umc/gfx/sdma ras on guest side
Changed from V1:
move sriov judgment in amdgpu_ras_interrupt_fatal_error_handler
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
[why]
lru_list not empty warning in sw fini during repeated device bind unbind.
There should be a amdgpu_fence_wait_empty() before the flush_delayed_work()
call as Christian suggested.
[how]
Move to do flush_delayed_work for ttm bo delayed delete wq after fence_driver_hw_fini.
Tested by: Yiqing Yao <yiqing.yao@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Yiqing Yao <yiqing.yao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
For kfd hasn't supported mes, skip kfd routines.
Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
mes_kiq parameter is used to enable mes kiq pipe.
This module parameter is unneccessary or enabled by default
in final version.
v2: reword commit message.
Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use the whole doorbell space for mes. Each queue in one process occupies
one doorbell slot to ring the queue submitting.
Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Some initial settings now are not available from
the atom data table. The assumption that !ps[0]
|| !ps[1] in amdgpu_atom_asic_init is not valid.
In addition, driver needs to strictly follow
atomfirmware structure (asic_init_parameters) to
initialize parameters used to execute asic_init
function, otherwise, the execution of asic_init
would fail.
This shall be applicable to all soc15 adapters,but
let make the transition on soc21 first.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This data has no dependencies, so encapsulate it all within
amdgpu_discovery.c.
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
- In the latest version of the header, there is a variable name change.
This should not cause any backward compatibility since the variable is
at the same offset in the struct.
Signed-off-by: Bokun Zhang <Bokun.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
With this, we can support more CG flags.
Signed-off-by: Evan Quan <evan.quan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
If the GPU is passed through to a guest VM, use the PCI
BAR for CPU FB access rather than the physical address of
carve out. The physical address is not valid in a guest.
v2: Fix HDP handing as suggested by Michel
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Michel Dänzer <mdaenzer@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
We don't support runtime pm on APUs. They support more
dynamic power savings using clock and powergating.
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
These were leftover from bring up and are no longer
necessary. The information is available via
the IP discovery table.
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
otherwise adev->ip_versions is still empty when amdgpu_gmc_noretry_set
is called.
Reviewed-by: Huang Rui <ray.huang@amd.com>
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
According to my investigation of the state of PCI
reset recently it's not working. The reason is
due to the fact the kernel PCI code rejects SBR
when there are more then one PF under same bridge
which we always have (at least AUDIO PF but usually
more) and that because SBR will reset all the PFS
and devices under the same bridge as you and you
cannot assume they support SBR.
Once we anble FLR support we can reenable this option as
FLR is doable on single PF and doens't have this
restriction.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
UAPI Changes:
Cross-subsystem Changes:
- Split out panel-lvds and lvds dt bindings .
- Put yes/no on/off disabled/enabled strings in linux/string_helpers.h
and use it in drivers and tomoyo.
- Clarify dma_fence_chain and dma_fence_array should never include eachother.
- Flatten chains in syncobj's.
- Don't double add in fbdev/defio when page is already enlisted.
- Don't sort deferred-I/O pages by default in fbdev.
Core Changes:
- Fix missing pm_runtime_put_sync in bridge.
- Set modifier support to only linear fb modifier if drivers don't
advertise support.
- As a result, we remove allow_fb_modifiers.
- Add missing clear for EDID Deep Color Modes in drm_reset_display_info.
- Assorted documentation updates.
- Warn once in drm_clflush if there is no arch support.
- Add missing select for dp helper in drm_panel_edp.
- Assorted small fixes.
- Improve fb-helper's clipping handling.
- Don't dump shmem mmaps in a core dump.
- Add accounting to ttm resource manager, and use it in amdgpu.
- Allow querying the detected eDP panel through debugfs.
- Add helpers for xrgb8888 to 8 and 1 bits gray.
- Improve drm's buddy allocator.
- Add selftests for the buddy allocator.
Driver Changes:
- Add support for nomodeset to a lot of drm drivers.
- Use drm_module_*_driver in a lot of drm drivers.
- Assorted small fixes to bridge/lt9611, v3d, vc4, vmwgfx, mxsfb, nouveau,
bridge/dw-hdmi, panfrost, lima, ingenic, sprd, bridge/anx7625, ti-sn65dsi86.
- Add bridge/it6505.
- Create DP and DVI-I connectors in ast.
- Assorted nouveau backlight fixes.
- Rework amdgpu reset handling.
- Add dt bindings for ingenic,jz4780-dw-hdmi.
- Support reading edid through aux channel in ingenic.
- Add a drm driver for Solomon SSD130x OLED displays.
- Add simple support for sharp LQ140M1JW46.
- Add more panels to nt35560.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEuXvWqAysSYEJGuVH/lWMcqZwE8MFAmIWLSEACgkQ/lWMcqZw
E8OP7hAAjix94EX5fhFa7OAdqUbFtsiKhK/4zNtV9FWpFiEsDBz+dlbfDQWIx5an
FIiiiQtSfWjpDv6pcMhoNf80w+dDbc/Cuauz6nNGO7Pkaerh2D/EPG74FD7f7nE3
EIScVs1heYtzM9usKrFKupNYgIdgZxmeovClWuE0OTjLOes2PGvvxXK6iJqNqJMX
VlDO5SR7GRqsDUDV6vmwl63uKL77xJXAahAXIx+BQ/1xrtEhlu6NwsgHIsmPmMSN
YluX34zc1xD/6/uUqvEdp7u46/5/He1c5Q/ia1WV3wRxsO/eMZ+axXqCZP3XGZdt
rMdGNtj1MWKkudYiowStWkCVSG/0fXJCFIAhvRmeZy+YqPdVlqZ2W7g4H1l9iJoo
UVfT9cHrKoxHsukvIEckC5Ov9v1yr39Bd4wUuqaUTUSxY8VID5vjY63TsXl9Zke1
SluTFe9qybbnRNz/hYRvwIS1eT8HvUauAfAhypGTLI5DYHTD7PawcfMJkNzCtJm4
Ta4SC3rTpkpN+7oc8SoNgqRHQ8U9KL5oksP0wVa8vwHsMptSd3X4pUljc6TcfjLv
GEo41D5AuJz3HRVcn9yqPbLoPE2FFB7bfwIMH77yNnoos4Izy/LGhKpN0YdImmI5
W5XVFB0jltGSIhkzLe1mFpLrdJwdUTSUVeCK4H5PhZZQEHLkVtg=
=HuwD
-----END PGP SIGNATURE-----
Merge tag 'drm-misc-next-2022-02-23' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next for v5.18:
UAPI Changes:
Cross-subsystem Changes:
- Split out panel-lvds and lvds dt bindings .
- Put yes/no on/off disabled/enabled strings in linux/string_helpers.h
and use it in drivers and tomoyo.
- Clarify dma_fence_chain and dma_fence_array should never include eachother.
- Flatten chains in syncobj's.
- Don't double add in fbdev/defio when page is already enlisted.
- Don't sort deferred-I/O pages by default in fbdev.
Core Changes:
- Fix missing pm_runtime_put_sync in bridge.
- Set modifier support to only linear fb modifier if drivers don't
advertise support.
- As a result, we remove allow_fb_modifiers.
- Add missing clear for EDID Deep Color Modes in drm_reset_display_info.
- Assorted documentation updates.
- Warn once in drm_clflush if there is no arch support.
- Add missing select for dp helper in drm_panel_edp.
- Assorted small fixes.
- Improve fb-helper's clipping handling.
- Don't dump shmem mmaps in a core dump.
- Add accounting to ttm resource manager, and use it in amdgpu.
- Allow querying the detected eDP panel through debugfs.
- Add helpers for xrgb8888 to 8 and 1 bits gray.
- Improve drm's buddy allocator.
- Add selftests for the buddy allocator.
Driver Changes:
- Add support for nomodeset to a lot of drm drivers.
- Use drm_module_*_driver in a lot of drm drivers.
- Assorted small fixes to bridge/lt9611, v3d, vc4, vmwgfx, mxsfb, nouveau,
bridge/dw-hdmi, panfrost, lima, ingenic, sprd, bridge/anx7625, ti-sn65dsi86.
- Add bridge/it6505.
- Create DP and DVI-I connectors in ast.
- Assorted nouveau backlight fixes.
- Rework amdgpu reset handling.
- Add dt bindings for ingenic,jz4780-dw-hdmi.
- Support reading edid through aux channel in ingenic.
- Add a drm driver for Solomon SSD130x OLED displays.
- Add simple support for sharp LQ140M1JW46.
- Add more panels to nt35560.
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/686ec871-e77f-c230-22e5-9e3bb80f064a@linux.intel.com
Dump the list of register values to trace event on GPU reset.
Signed-off-by: Somalapuram Amaranath <Amaranath.Somalapuram@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This test is not particularly useful now that GTT and GART
are decoupled in the driver.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Now that we expose the benchmarks via debugfs, there is no
longer a need for the module parameter.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
To avoid multiple runs in parallel to avoid mixing results.
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Add device pointer so scheduler's printing can use
DRM_DEV_ERROR() instead, which makes life easier under multiple GPU
scenario.
v2: amend all calls of drm_sched_init()
v3: fill dev pointer for all drm_sched_init() calls
Signed-off-by: Jiawei Gu <Jiawei.Gu@amd.com>
Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Christian König <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220221095705.5290-1-Jiawei.Gu@amd.com
Evaluating `pcie_aspm_enabled` as part of driver probe has the implication
that if one PCIe bridge with an AMD GPU connected doesn't support ASPM
then none of them do. This is an invalid assumption as the PCIe core will
configure ASPM for individual PCIe bridges.
Create a new helper function that can be called by individual dGPUs to
react to the `amdgpu_aspm` module parameter without having negative results
for other dGPUs on the PCIe bus.
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>