Commit Graph

327 Commits

Author SHA1 Message Date
Linus Torvalds
8d0749b4f8 drm for 5.17-rc1
core:
 - add privacy screen support
 - move nomodeset option into drm subsystem
 - clean up nomodeset handling in drivers
 - make drm_irq.c legacy
 - fix stack_depot name conflicts
 - remove DMA_BUF_SET_NAME ioctl restrictions
 - sysfs: send hotplug event
 - replace several DRM_* logging macros with drm_*
 - move hashtable to legacy code
 - add error return from gem_create_object
 - cma-helper: improve interfaces, drop CONFIG_DRM_KMS_CMA_HELPER
 - kernel.h related include cleanups
 - support XRGB2101010 source buffers
 
 ttm:
 - don't include drm hashtable
 - stop pruning fences after wait
 - documentation updates
 
 dma-buf:
 - add dma_resv selftest
 - add debugfs helpers
 - remove dma_resv_get_excl_unlocked
 - documentation
 - make fences mandatory in dma_resv_add_excl_fence
 
 dp:
 - add link training delay helpers
 
 gem:
 - link shmem/cma helpers into separate modules
 - use dma_resv iteratior
 - import dma-buf namespace into gem helper modules
 
 scheduler:
 - fence grab fix
 - lockdep fixes
 
 bridge:
 - switch to managed MIPI DSI helpers
 - register and attach during probe fixes
 - convert to YAML in several places.
 
 panel:
 - add bunch of new panesl
 
 simpledrm:
 - support FB_DAMAGE_CLIPS
 - support virtual screen sizes
 - add Apple M1 support
 
 amdgpu:
 - enable seamless boot for DCN 3.01
 - runtime PM fixes
 - use drm_kms_helper_connector_hotplug_event
 - get all fences at once
 - use generic drm fb helpers
 - PSR/DPCD/LTTPR/DSC/PM/RAS/OLED/SRIOV fixes
 - add smart trace buffer (STB) for supported GPUs
 - display debugfs entries
 - new SMU debug option
 - Documentation update
 
 amdkfd:
 - IP discovery enumeration refactor
 - interface between driver fixes
 - SVM fixes
 - kfd uapi header to define some sysfs bitfields.
 
 i915:
 - support VESA panel backlights
 - enable ADL-P by default
 - add eDP privacy screen support
 - add Raptor Lake S (RPL-S) support
 - DG2 page table support
 - lots of GuC/HuC fw refactoring
 - refactored i915->gt interfaces
 - CD clock squashing support
 - enable 10-bit gamma support
 - update ADL-P DMC fw to v2.14
 - enable runtime PM autosuspend by default
 - ADL-P DSI support
 - per-lane DP drive settings for ICL+
 - add support for pipe C/D DMC firmware
 - Atomic gamma LUT updates
 - remove CCS FB stride restrictions on ADL-P
 - VRR platform support for display 11
 - add support for display audio codec keepalive
 - lots of display refactoring
 - fix runtime PM handling during PXP suspend
 - improved eviction performance with async TTM moves
 - async VMA unbinding improvements
 - VMA locking refactoring
 - improved error capture robustness
 - use per device iommu checks
 - drop bits stealing from i915_sw_fence function ptr
 - remove dma_resv_prune
 - add IC cache invalidation on DG2
 
 nouveau:
 - crc fixes
 - validate LUTs in atomic check
 - set HDMI AVI RGB quant to full
 
 tegra:
 - buffer objects reworks for dma-buf compat
 - NVDEC driver uAPI support
 - power management improvements
 
 etnaviv:
 - IOMMU enabled system support
 - fix > 4GB command buffer mapping
 - close a DoS vector
 - fix spurious GPU resets
 
 ast:
 - fix i2c initialization
 
 rcar-du:
 - DSI output support
 
 exynos:
 - replace legacy gpio interface
 - implement generic GEM object mmap
 
 msm:
 - dpu plane state cleanup in prep for multirect
 - dpu debugfs cleanups
 - dp support for sc7280
 - a506 support
 - removal of struct_mutex
 - remove old eDP sub-driver
 
 anx7625:
 - support MIPI DSI input
 - support HDMI audio
 - fix reading EDID
 
 lvds:
 - fix bridge DT bindings
 
 megachips:
 - probe both bridges before registering
 
 dw-hdmi:
 - allow interlace on bridge
 
 ps8640:
 - enable runtime PM
 - support aux-bus
 
 tx358768:
 - enable reference clock
 - add pulse mode support
 
 ti-sn65dsi86:
 - use regmap bulk write
 - add PWM support
 
 etnaviv:
 - get all fences at once
 
 gma500:
 - gem object cleanups
 
 kmb:
 - enable fb console
 
 radeon:
 - use dma_resv_wait_timeout
 
 rockchip:
 - add DSP hold timeout
 - suspend/resume fixes
 - PLL clock fixes
 - implement mmap in GEM object functions
 - use generic fbdev emulation
 
 sun4i:
 - use CMA helpers without vmap support
 
 vc4:
 - fix HDMI-CEC hang with display is off
 - power on HDMI controller while disabling
 - support 4K@60Hz modes
 - support 10-bit YUV 4:2:0 output
 
 vmwgfx:
 - fix leak on probe errors
 - fail probing on broken hosts
 - new placement for MOB page tables
 - hide internal BOs from userspace
 - implement GEM support
 - implement GL 4.3 support
 
 virtio:
 - overflow fixes
 
 xen:
 - implement mmap as GEM object function
 
 omapdrm:
 - fix scatterlist export
 - support virtual planes
 
 mediatek:
 - MT8192 support
 - CMDQ refinement
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmHX1vMACgkQDHTzWXnE
 hr6rAw/9ES5RO5N3Ku9foFk1CI9bqy1Kh663KLkkEc+rDdhKpiZbBnAsrKkZ9sGu
 fNuHmWNN5nWXtDSOqHWuslt3F7Gh+qEBQtlkqC9mZsBm3bWB0aJK6E4QaJxfeSaK
 ta6AmyGx8DaV+C69i86dnemQurYSDVjROd7LDPKnCU0Fye/JxiXSXQmXksKMFVxd
 x5vmO9yfeDSg3EF+u1yB6nJNUYZBV0vhrAfjPqxPCRBXuQc7akuaglE/SFwlGnEk
 vn0GjVHEQcRTqYKrHr64xvQxIoKXcJP0pkDUyT7KYCsyj8GJkvxkb7/ls5pp5DvL
 SwyNg3J3vwUVP6w6GEvzf3ffG720qqUZvCbvLmE+A/t2DhGILiAm+HXSo43PTOW8
 uagT7Gxma8dy8EovjSxioS9HPX8Gcu+S+XYavgOsevOZ7oeEt4f4TLW7LXsw9d6y
 75FrMhiUpreab5hAh8Le0swuLYZHjdnJRdjSTqZJ/T6VdTdVftLT6IfwvSDx5CHy
 cWuufgcAjd7xVTXFquHWYXWLTQkiSMGf1M02jx9IWolTd4Cm41LNBhqMEDHZLHJD
 7ngGgoaREVDQ+MqjG90yfIwJFIpJPI3YOaHLi/Kznga+iDzFY6cyOQWW2vX7ZdY5
 7+LJWsgGT8Feb7/bzD5hX1mYqJLxh1pWUIaqIKMl+7LJL7gTVU8=
 =MByd
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2022-01-07' of git://anongit.freedesktop.org/drm/drm

Pull drm updates from Dave Airlie:
 "Highlights are support for privacy screens found in new laptops, a
  bunch of nomodeset refactoring, and i915 enables ADL-P systems by
  default, while starting to add RPL-S support.

  vmwgfx adds GEM and support for OpenGL 4.3 features in userspace.

  Lots of internal refactorings around dma reservations, and lots of
  driver refactoring as well.

  Summary:

  core:
   - add privacy screen support
   - move nomodeset option into drm subsystem
   - clean up nomodeset handling in drivers
   - make drm_irq.c legacy
   - fix stack_depot name conflicts
   - remove DMA_BUF_SET_NAME ioctl restrictions
   - sysfs: send hotplug event
   - replace several DRM_* logging macros with drm_*
   - move hashtable to legacy code
   - add error return from gem_create_object
   - cma-helper: improve interfaces, drop CONFIG_DRM_KMS_CMA_HELPER
   - kernel.h related include cleanups
   - support XRGB2101010 source buffers

  ttm:
   - don't include drm hashtable
   - stop pruning fences after wait
   - documentation updates

  dma-buf:
   - add dma_resv selftest
   - add debugfs helpers
   - remove dma_resv_get_excl_unlocked
   - documentation
   - make fences mandatory in dma_resv_add_excl_fence

  dp:
   - add link training delay helpers

  gem:
   - link shmem/cma helpers into separate modules
   - use dma_resv iteratior
   - import dma-buf namespace into gem helper modules

  scheduler:
   - fence grab fix
   - lockdep fixes

  bridge:
   - switch to managed MIPI DSI helpers
   - register and attach during probe fixes
   - convert to YAML in several places.

  panel:
   - add bunch of new panesl

  simpledrm:
   - support FB_DAMAGE_CLIPS
   - support virtual screen sizes
   - add Apple M1 support

  amdgpu:
   - enable seamless boot for DCN 3.01
   - runtime PM fixes
   - use drm_kms_helper_connector_hotplug_event
   - get all fences at once
   - use generic drm fb helpers
   - PSR/DPCD/LTTPR/DSC/PM/RAS/OLED/SRIOV fixes
   - add smart trace buffer (STB) for supported GPUs
   - display debugfs entries
   - new SMU debug option
   - Documentation update

  amdkfd:
   - IP discovery enumeration refactor
   - interface between driver fixes
   - SVM fixes
   - kfd uapi header to define some sysfs bitfields.

  i915:
   - support VESA panel backlights
   - enable ADL-P by default
   - add eDP privacy screen support
   - add Raptor Lake S (RPL-S) support
   - DG2 page table support
   - lots of GuC/HuC fw refactoring
   - refactored i915->gt interfaces
   - CD clock squashing support
   - enable 10-bit gamma support
   - update ADL-P DMC fw to v2.14
   - enable runtime PM autosuspend by default
   - ADL-P DSI support
   - per-lane DP drive settings for ICL+
   - add support for pipe C/D DMC firmware
   - Atomic gamma LUT updates
   - remove CCS FB stride restrictions on ADL-P
   - VRR platform support for display 11
   - add support for display audio codec keepalive
   - lots of display refactoring
   - fix runtime PM handling during PXP suspend
   - improved eviction performance with async TTM moves
   - async VMA unbinding improvements
   - VMA locking refactoring
   - improved error capture robustness
   - use per device iommu checks
   - drop bits stealing from i915_sw_fence function ptr
   - remove dma_resv_prune
   - add IC cache invalidation on DG2

  nouveau:
   - crc fixes
   - validate LUTs in atomic check
   - set HDMI AVI RGB quant to full

  tegra:
   - buffer objects reworks for dma-buf compat
   - NVDEC driver uAPI support
   - power management improvements

  etnaviv:
   - IOMMU enabled system support
   - fix > 4GB command buffer mapping
   - close a DoS vector
   - fix spurious GPU resets

  ast:
   - fix i2c initialization

  rcar-du:
   - DSI output support

  exynos:
   - replace legacy gpio interface
   - implement generic GEM object mmap

  msm:
   - dpu plane state cleanup in prep for multirect
   - dpu debugfs cleanups
   - dp support for sc7280
   - a506 support
   - removal of struct_mutex
   - remove old eDP sub-driver

  anx7625:
   - support MIPI DSI input
   - support HDMI audio
   - fix reading EDID

  lvds:
   - fix bridge DT bindings

  megachips:
   - probe both bridges before registering

  dw-hdmi:
   - allow interlace on bridge

  ps8640:
   - enable runtime PM
   - support aux-bus

  tx358768:
   - enable reference clock
   - add pulse mode support

  ti-sn65dsi86:
   - use regmap bulk write
   - add PWM support

  etnaviv:
   - get all fences at once

  gma500:
   - gem object cleanups

  kmb:
   - enable fb console

  radeon:
   - use dma_resv_wait_timeout

  rockchip:
   - add DSP hold timeout
   - suspend/resume fixes
   - PLL clock fixes
   - implement mmap in GEM object functions
   - use generic fbdev emulation

  sun4i:
   - use CMA helpers without vmap support

  vc4:
   - fix HDMI-CEC hang with display is off
   - power on HDMI controller while disabling
   - support 4K@60Hz modes
   - support 10-bit YUV 4:2:0 output

  vmwgfx:
   - fix leak on probe errors
   - fail probing on broken hosts
   - new placement for MOB page tables
   - hide internal BOs from userspace
   - implement GEM support
   - implement GL 4.3 support

  virtio:
   - overflow fixes

  xen:
   - implement mmap as GEM object function

  omapdrm:
   - fix scatterlist export
   - support virtual planes

  mediatek:
   - MT8192 support
   - CMDQ refinement"

* tag 'drm-next-2022-01-07' of git://anongit.freedesktop.org/drm/drm: (1241 commits)
  drm/amdgpu: no DC support for headless chips
  drm/amd/display: fix dereference before NULL check
  drm/amdgpu: always reset the asic in suspend (v2)
  drm/amdgpu: put SMU into proper state on runpm suspending for BOCO capable platform
  drm/amd/display: Fix the uninitialized variable in enable_stream_features()
  drm/amdgpu: fix runpm documentation
  amdgpu/pm: Make sysfs pm attributes as read-only for VFs
  drm/amdgpu: save error count in RAS poison handler
  drm/amdgpu: drop redundant semicolon
  drm/amd/display: get and restore link res map
  drm/amd/display: support dynamic HPO DP link encoder allocation
  drm/amd/display: access hpo dp link encoder only through link resource
  drm/amd/display: populate link res in both detection and validation
  drm/amd/display: define link res and make it accessible to all link interfaces
  drm/amd/display: 3.2.167
  drm/amd/display: [FW Promotion] Release 0.0.98
  drm/amd/display: Undo ODM combine
  drm/amd/display: Add reg defs for DCN303
  drm/amd/display: Changed pipe split policy to allow for multi-display pipe split
  drm/amd/display: Set optimize_pwr_state for DCN31
  ...
2022-01-10 12:58:46 -08:00
Yazen Ghannam
91f75eb481 x86/MCE/AMD, EDAC/mce_amd: Support non-uniform MCA bank type enumeration
AMD systems currently lay out MCA bank types such that the type of bank
number "i" is either the same across all CPUs or is Reserved/Read-as-Zero.

For example:

  Bank # | CPUx | CPUy
    0      LS     LS
    1      RAZ    UMC
    2      CS     CS
    3      SMU    RAZ

Future AMD systems will lay out MCA bank types such that the type of
bank number "i" may be different across CPUs.

For example:

  Bank # | CPUx | CPUy
    0      LS     LS
    1      RAZ    UMC
    2      CS     NBIO
    3      SMU    RAZ

Change the structures that cache MCA bank types to be per-CPU and update
smca_get_bank_type() to handle this change.

Move some SMCA-specific structures to amd.c from mce.h, since they no
longer need to be global.

Break out the "count" for bank types from struct smca_hwid, since this
should provide a per-CPU count rather than a system-wide count.

Apply the "const" qualifier to the struct smca_hwid_mcatypes array. The
values in this array should not change at runtime.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211216162905.4132657-3-yazen.ghannam@amd.com
2021-12-22 17:22:09 +01:00
Isabella Basso
929bb8e200 drm/amdgpu: fix amdgpu_ras_mca_query_error_status scope
This commit fixes the compile-time warning below:

 warning: no previous prototype for ‘amdgpu_ras_mca_query_error_status’
 [-Wmissing-prototypes]

Changes since v1:
- As suggested by Alexander Deucher:
  1. Make function static instead of adding prototype.

Signed-off-by: Isabella Basso <isabbasso@riseup.net>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-13 16:33:17 -05:00
Isabella Basso
bbe04dec5c drm/amd: fix improper docstring syntax
This fixes various warnings relating to erroneous docstring syntax, of
which some are listed below:

 warning: Function parameter or member 'adev' not described in
 'amdgpu_atomfirmware_ras_rom_addr'
 ...
 warning: expecting prototype for amdgpu_atpx_validate_functions().
 Prototype was for amdgpu_atpx_validate() instead
 ...
 warning: Excess function parameter 'mem' description in 'amdgpu_preempt_mgr_new'
 ...
 warning: Cannot understand  * @kfd_get_cu_occupancy - Collect number of
 waves in-flight on this device
 ...
 warning: This comment starts with '/**', but isn't a kernel-doc
 comment. Refer Documentation/doc-guide/kernel-doc.rst

Signed-off-by: Isabella Basso <isabbasso@riseup.net>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-13 16:32:34 -05:00
Tao Zhou
655ff3538e drm/amdgpu: enable RAS poison flag when GPU is connected to CPU
The RAS poison mode is enabled by default on the platform.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-13 16:32:34 -05:00
Stanley.Yang
cf63b70272 drm/amdgpu: skip umc ras error count harvest
remove in recovery stat check, skip umc ras err cnt
harvest in amdgpu_ras_log_on_err_counter

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-07 13:12:19 -05:00
Stanley.Yang
aed1faab9d drm/amdgpu: only skip get ecc info for aldebaran
skip get ecc info for aldebarn through check ip version
do not affect other asic type

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-07 13:07:55 -05:00
Stanley.Yang
bab73f092d drm/amdgpu: skip query ecc info in gpu recovery
this is a workaround due to get ecc info failed during gpu recovery

[  700.236122] amdgpu 0000:09:00.0: amdgpu: Failed to export SMU ecc table!
[  700.236128] amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
[  704.331171] amdgpu: qcm fence wait loop timeout expired
[  704.331194] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
[  704.332445] amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
[  704.332448] amdgpu 0000:09:00.0: amdgpu: Bailing on TDR for s_job:ffffffffffffffff, as another already in progress
[  704.332456] amdgpu: Pasid 0x8000 destroy queue 0 failed, ret -62
[  710.360924] amdgpu 0000:09:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000013 SMN_C2PMSG_82:0x00000007
[  710.360964] amdgpu 0000:09:00.0: amdgpu: Failed to disable smu features.
[  710.361002] amdgpu 0000:09:00.0: amdgpu: Fail to disable dpm features!
[  710.361014] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-02 12:43:06 -05:00
Stanley.Yang
232d1d43b5 drm/amdgpu: fix disable ras feature failed when unload drvier v2
v2:
    still need call ras_disable_all_featrures to handle
    ras initilization failure case.

Function amdgpu_device_fini_hw is called before amdgpu_device_fini_sw,
so ras ta will unload before send ras disable command, ras dsiable operation
must before hw fini.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01 16:03:22 -05:00
Stanley.Yang
fdcb279d5b drm/amdgpu: query umc error info from ecc_table v2
if smu support ECCTABLE, driver can message smu to get ecc_table
then query umc error info from ECCTABLE

v2:
    optimize source code makes logical more reasonable

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22 14:45:46 -05:00
Candice Li
d9a69fe512 drm/amdgpu: Add recovery_lock to save bad pages function
Fix race condition failure during UMC UE injection.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-11-22 14:45:02 -05:00
Mukul Joshi
91a1a52d03 drm/amdgpu: Fix RAS page retirement with mode2 reset on Aldebaran
During mode2 reset, the GPU is temporarily removed from the
mgpu_info list. As a result, page retirement fails because it
cannot find the GPU in the GPU list.
To fix this, create our own list of GPUs that support MCE notifier
based page retirement and use that list to check if the UMC error
occurred on a GPU that supports MCE notifier based page retirement.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-13 14:14:48 -04:00
Mukul Joshi
12b2cab790 drm/amdgpu: Register MCE notifier for Aldebaran RAS
On Aldebaran, GPU driver will handle bad page retirement
for GPU memory even though UMC is host managed. As a result,
register a bad page retirement handler on the mce notifier
chain to retire bad pages on Aldebaran.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-06 15:53:38 -04:00
John Clements
eb601e61d3 drm/amdgpu: resolve RAS query bug
clear error count when persistant harvesting is not enabled

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-10-04 15:22:57 -04:00
Tao Zhou
f524dd54a7 drm/amdgpu: skip umc ras irq handling in poison mode (v2)
In ras poison mode, umc uncorrectable error will be ignored until
the corrupted data consumed by another ras module (such as gfx, sdma).

v2: update the debug message and replace dev_warn with dev_info.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-09-28 09:30:07 -04:00
Tao Zhou
e43488493c drm/amdgpu: set poison supported flag for RAS (v2)
Add RAS poison supported flag and tell PSP RAS TA about the info.

v2: rename poison mode to poison supported, we can also disable poison
mode even we support it.
    print value of poison supported if ras feature enablement fails.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-09-28 09:30:07 -04:00
Candice Li
9080a18fc5 drm/amdgpu: Remove all code paths under the EAGAIN path in RAS late init
All code paths under the EAGAIN path in RAS late init are unused.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-09-23 16:35:13 -04:00
John Clements
640ae42efb drm/amdgpu: Updated RAS infrastructure
Update RAS infrastructure to support RAS query for MCA subblocks

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-09-23 16:34:43 -04:00
John Clements
3771449bc8 drm/amdgpu: Update RAS trigger error block support
Added trigger error support for MP0/MP1/MPIO blocks

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-09-14 15:37:48 -04:00
Candice Li
a0a2f7bb22 drm/amd/amdgpu: add mpio to ras block
Add MPIO to RAS block

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-09-01 16:55:11 -04:00
Candice Li
893cf382c0 drm/amd/amdgpu: remove unnecessary RAS context field
Delete ras_if->name in the RAS ctx structure and remove related lines.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-08-16 15:35:55 -04:00
Candice Li
6457205c07 drm/amd/amdgpu: consolidate PSP TA context
Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-08-16 15:18:04 -04:00
Luben Tuikov
4d9f771e11 drm/amdgpu: Return error if no RAS
In amdgpu_ras_query_error_count() return an error
if the device doesn't support RAS. This prevents
that function from having to always set the values
of the integer pointers (if set), and thus
prevents function side effects--always to have to
set values of integers if integer pointers set,
regardless of whether RAS is supported or
not--with this change this side effect is
mitigated.

Also, if no pointers are set, don't count, since
we've no way of reporting the counts.

Also, give this function a kernel-doc.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Reported-by: Tom Rix <trix@redhat.com>
Fixes: a46751fbcd ("drm/amdgpu: Fix RAS function interface")
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-08 15:13:07 -04:00
Luben Tuikov
1d9d2ca85b drm/amdgpu: Fix koops when accessing RAS EEPROM
Debugfs RAS EEPROM files are available when
the ASIC supports RAS, and when the debugfs is
enabled, an also when "ras_enable" module
parameter is set to 0. However in this case,
we get a kernel oops when accessing some of
the "ras_..." controls in debugfs. The reason
for this is that struct amdgpu_ras::adev is
unset. This commit sets it, thus enabling access
to those facilities. Note that this facilitates
EEPROM access and not necessarily RAS features or
functionality.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:25:33 -04:00
Luben Tuikov
c65b0805e7 drm/amdgpu: RAS EEPROM table is now in debugfs
Add "ras_eeprom_size" file in debugfs, which
reports the maximum size allocated to the RAS
table in EEROM, as the number of bytes and the
number of records it could store. For instance,

$cat /sys/kernel/debug/dri/0/ras/ras_eeprom_size
262144 bytes or 10921 records
$_

Add "ras_eeprom_table" file in debugfs, which
dumps the RAS table stored EEPROM, in a formatted
way. For instance,

$cat ras_eeprom_table
 Signature    Version  FirstOffs       Size   Checksum
0x414D4452 0x00010000 0x00000014 0x000000EC 0x000000DA
Index  Offset ErrType Bank/CU          TimeStamp      Offs/Addr MemChl MCUMCID    RetiredPage
    0 0x00014      ue    0x00 0x00000000607608DC 0x000000000000   0x00    0x00 0x000000000000
    1 0x0002C      ue    0x00 0x00000000607608DC 0x000000001000   0x00    0x00 0x000000000001
    2 0x00044      ue    0x00 0x00000000607608DC 0x000000002000   0x00    0x00 0x000000000002
    3 0x0005C      ue    0x00 0x00000000607608DC 0x000000003000   0x00    0x00 0x000000000003
    4 0x00074      ue    0x00 0x00000000607608DC 0x000000004000   0x00    0x00 0x000000000004
    5 0x0008C      ue    0x00 0x00000000607608DC 0x000000005000   0x00    0x00 0x000000000005
    6 0x000A4      ue    0x00 0x00000000607608DC 0x000000006000   0x00    0x00 0x000000000006
    7 0x000BC      ue    0x00 0x00000000607608DC 0x000000007000   0x00    0x00 0x000000000007
    8 0x000D4      ue    0x00 0x00000000607608DD 0x000000008000   0x00    0x00 0x000000000008
$_

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Xinhui Pan <xinhui.pan@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:24:41 -04:00
Luben Tuikov
63d4c081a5 drm/amdgpu: Optimize EEPROM RAS table I/O
Split functionality between read and write, which
simplifies the code and exposes areas of
optimization and more or less complexity, and take
advantage of that.

Read and write the table in one go; use a separate
stage to decode or encode the data, as opposed to
on the fly, which keeps the I2C bus busy. Use a
single read/write to read/write the table or at
most two if the number of records we're
reading/writing wraps around.

Check the check-sum of a table in EEPROM on init.

Update the checksum at the same time as when
updating the table header signature, when the
threshold was increased on boot.

Take advantage of arithmetic modulo 256, that is,
use a byte!, to greatly simplify checksum
arithmetic.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:24:41 -04:00
Luben Tuikov
0686627b3f drm/amdgpu: Some renames
Qualify with "ras_". Use kernel's own--don't
redefine your own.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:24:41 -04:00
Luben Tuikov
e4e6a58935 drm/amdgpu: Use explicit cardinality for clarity
RAS_MAX_RECORD_NUM may mean the maximum record
number, as in the maximum house number on your
street, or it may mean the maximum number of
records, as in the count of records, which is also
a number. To make this distinction whether the
number is ordinal (index) or cardinal (count),
rename this macro to RAS_MAX_RECORD_COUNT.

This makes it easy to understand what it refers
to, especially when we compute quantities such as,
how many records do we have left in the table,
especially when there are so many other numbers,
quantities and numerical macros around.

Also rename the long,
amdgpu_ras_eeprom_get_record_max_length() to the
more succinct and clear,
amdgpu_ras_eeprom_max_record_count().

When computing the threshold, which also deals
with counts, i.e. "how many", use cardinal
"max_eeprom_records_count", than the quantitative
"max_eeprom_records_len".

Simplify the logic here and there, as well.

Cc: Guchun Chen <guchun.chen@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:24:41 -04:00
Luben Tuikov
cf696091d3 drm/amdgpu: Return result fix in RAS
The low level EEPROM write method, doesn't return
1, but the number of bytes written. Thus do not
compare to 1, instead, compare to greater than 0
for success.

Other cleanup: if the lower layers returned
-errno, then return that, as opposed to
overwriting the error code with one-fits-all
-EINVAL. For instance, some return -EAGAIN.

Cc: Jean Delvare <jdelvare@suse.de>
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Cc: Lijo Lazar <Lijo.Lazar@amd.com>
Cc: Stanley Yang <Stanley.Yang@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:24:41 -04:00
Luben Tuikov
1fab841ff6 drm/amdgpu: RAS xfer to read/write
Wrap amdgpu_ras_eeprom_xfer(..., bool write),
into amdgpu_ras_eeprom_read() and
amdgpu_ras_eeprom_write(), as that makes reading
and understanding the code clearer.

Cc: Jean Delvare <jdelvare@suse.de>
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Cc: Lijo Lazar <Lijo.Lazar@amd.com>
Cc: Stanley Yang <Stanley.Yang@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:24:40 -04:00
Luben Tuikov
a43996573a drm/amdgpu: Rename misspelled function
Instead of fixing the spelling in
  amdgpu_ras_eeprom_process_recods(),
rename it to,
  amdgpu_ras_eeprom_xfer(),
to look similar to other I2C and protocol
transfer (read/write) functions.

Also to keep the column span to within reason by
using a shorter name.

Change the "num" function parameter from "int" to
"const u32" since it is the number of items
(records) to xfer, i.e. their count, which cannot
be a negative number.

Also swap the order of parameters, keeping the
pointer to records and their number next to each
other, while the direction now becomes the last
parameter.

Cc: Jean Delvare <jdelvare@suse.de>
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Cc: Lijo Lazar <Lijo.Lazar@amd.com>
Cc: Stanley Yang <Stanley.Yang@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-01 00:24:40 -04:00
Stanley.Yang
513befa634 drm/amdgpu: message smu to update hbm bad page number
Use SMU to update the bad pages rather than directly
accessing the EEPROM from the driver.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-18 17:11:56 -04:00
Stanley.Yang
e11d5e0d68 drm/amdgpu: add vega20 to ras quirk list
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-18 17:01:35 -04:00
Guchun Chen
a3fbb0d810 drm/amdgpu: use adev_to_drm macro for consistency (v2)
Use adev_to_drm() to get to the drm_device pointer.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-06-11 16:03:32 -04:00
Dave Airlie
5745d647d5 Merge tag 'amd-drm-next-5.14-2021-06-02' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.14-2021-06-02:

amdgpu:
- GC/MM register access macro clean up for SR-IOV
- Beige Goby updates
- W=1 Fixes
- Aldebaran fixes
- Misc display fixes
- ACPI ATCS/ATIF handling rework
- SR-IOV fixes
- RAS fixes
- 16bpc fixed point format support
- Initial smartshift support
- RV/PCO power tuning fixes for suspend/resume
- More buffer object subclassing work
- Add new INFO query for additional vbios information
- Add new placement for preemptable SG buffers

amdkfd:
- Misc fixes

radeon:
- W=1 Fixes
- Misc cleanups

UAPI:
- Add new INFO query for additional vbios information
  Useful for debugging vbios related issues.  Proposed umr patch:
  https://patchwork.freedesktop.org/patch/433297/
- 16bpc fixed point format support
  IGT test:
  https://lists.freedesktop.org/archives/igt-dev/2021-May/031507.html
  Proposed Vulkan patch:
  a25d480207
- Add a new GEM flag which is only used internally in the kernel driver.  Userspace
  is not allowed to set it.

drm:
- 16bpc fixed point format fourcc

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210602214009.4553-1-alexander.deucher@amd.com
2021-06-04 06:13:57 +10:00
Luben Tuikov
05adfd80cc drm/amdgpu: Use delayed work to collect RAS error counters
On Context Query2 IOCTL return the correctable and
uncorrectable errors in O(1) fashion, from cached
values, and schedule a delayed work function to
calculate and cache them for the next such IOCTL.

v2: Cancel pending delayed work at ras_fini().
v3: Remove conditionals when dealing with delayed
    work manipulation as they're inherently racy.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-27 12:23:06 -04:00
Luben Tuikov
a46751fbcd drm/amdgpu: Fix RAS function interface
The correctable and uncorrectable errors
are calculated at each invocation of this
function. Therefore, it is highly inefficient to
return just one of them based on a Boolean
input. If the caller wants both, twice the work
would be done. (And this work is O(n^3) on
Vega20.)

Fix this "interface" to simply return what it had
calculated--both values. Let the caller choose
what it wants to record, inspect, use.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-27 12:22:54 -04:00
Thomas Zimmermann
304ba5dca4 Merge drm/drm-next into drm-misc-next
Backmerging from drm/drm-next to the patches for AMD devices
for v5.14.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2021-05-22 07:17:05 +02:00
Andrey Grodzovsky
72c8c97b15 drm/amdgpu: Split amdgpu_device_fini into early and late
Some of the stuff in amdgpu_device_fini such as HW interrupts
disable and pending fences finilization must be done right away on
pci_remove while most of the stuff which relates to finilizing and
releasing driver data structures can be kept until
drm_driver.release hook is called, i.e. when the last device
reference is dropped.

v4: Change functions prefix early->hw and late->sw

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-3-andrey.grodzovsky@amd.com
2021-05-19 23:45:49 -04:00
John Clements
8f6368a9c9 drm/amdgpu: Conditionally reset RAS counters on boot
Only clear RAS error counters if perestent EDC harvesting is not supported

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-19 22:38:11 -04:00
Dwaipayan Ray
c666bbf0e9 drm/amd/amdgpu: Fix errors in function documentation
Fix a couple of syntax errors and removed one excess
parameter in the function documentations which lead
to kernel docs build warning.

Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Dwaipayan Ray <dwaipayanray1@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:14:49 -04:00
Dennis Li
7780f50358 drm/amdgpu: add function to clear MMEA error status for aldebaran
For aldebaran, hardware will not clear error status automatically when
reading error status register, insteadly driver should set clear bit of
the error status register explicitly to clear error status.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:11:44 -04:00
Dennis Li
011907fda3 drm/amdgpu: covert ras status to kernel errno
The original codes use ras status and kernl errno together in the same
function, which is a wrong code style.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:09:37 -04:00
Oak Zeng
7ddd977085 drm/amdgpu: Quit RAS initialization earlier if RAS is disabled
If RAS is disabled through amdgpu_ras_enable kernel parameter,
we should quit the RAS initialization eariler to avoid initialization
of some RAS data structure such as sysfs etc.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:08:48 -04:00
Luben Tuikov
ef0d7d2001 drm/amdgpu: Export ras_*_enabled to debugfs
Export the runtime-set "ras_hw_enabled" and
"ras_enabled" to debugfs, for debugging.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:08:25 -04:00
Luben Tuikov
8ab0d6f030 drm/amdgpu: Rename to ras_*_enabled
Rename,
  ras_hw_supported --> ras_hw_enabled, and
  ras_features     --> ras_enabled,
to show that ras_enabled is a subset of
ras_hw_enabled, which itself is a subset
of the ASIC capability.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:08:12 -04:00
Luben Tuikov
e509965e58 drm/amdgpu: Move up ras_hw_supported
Move ras_hw_supported into struct amdgpu_dev.
The dependency is:
struct amdgpu_ras <== struct amdgpu_dev <== ASIC,
read as "struct amdgpu_ras depends on struct
amdgpu_dev, which depends on the hardware."

This can be loosely understood as, "if RAS is
supported, which is property of the ASIC (struct
amdgpu_dev), then we can access struct
amdgpu_ras."

v2: Fix a typo: must binary AND in ternary cond
    in amdgpu_ras.c

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:08:06 -04:00
Luben Tuikov
acdae2169b drm/amdgpu: Remove redundant ras->supported
Remove redundant ras->supported, as this value
is also stored in adev->ras_features.

Use adev->ras_features, as that supercedes "ras",
since the latter is its member.

The dependency goes like this:
ras <== adev->ras_features <== hw_supported,
and is read as "ras depends on ras_features, which
depends on hw_supported." The arrows show the flow
of information, i.e. the dependency update.

"hw_supported" should also live in "adev".

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:07:56 -04:00
Stanley.Yang
f50160cf0f drm/amdgpu: force enable gfx ras for vega20 ws
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:06:44 -04:00
Hawking Zhang
1f6e8eb153 drm/amdgpu: enable gfx ras in aldebran by default
gfx ras now can be enabled by default in aldebaran

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:06:43 -04:00
Hawking Zhang
78871b6c8b drm/amdgpu: enable ras error count query and reset for HDP
add hdp block ras error query and reset support in
amdgpu ras error count query and reset interface

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-05-10 18:06:43 -04:00
Hawking Zhang
a30f128602 drm/amdgpu: provide socket/die id info in RAS message
Add socket/die information in RAS messages for platforms
that support query those information

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-28 23:35:50 -04:00
Hawking Zhang
502f0e2804 drm/amdgpu: disable gfx ras by default in aldebaran
aldebaran gfx ras is still under development

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-23 17:16:14 -04:00
Stanley.Yang
19d0dfda4c drm/amdgpu: optimize gfx ras features flag clean
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-23 17:15:52 -04:00
Mukul Joshi
1f0d8e3781 drm/amdgpu: Reset RAS error count and status regs
Reset the RAS error count and error status registers after
reading to prevent over reporting error counts on Aldebaran.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-By: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:45:10 -04:00
Dennis Li
6df23f4c5c drm/amdgpu: fix a error injection failed issue
because "sscanf(str, "retire_page")" always return 0, if application use
the raw data for error injection, it always wrongly falls into "op ==
3". Change to use strstr instead.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-20 21:36:01 -04:00
Luben Tuikov
546aa546b0 drm/amdgpu: Add double-sscanf but invert
Add back the double-sscanf so that both decimal
and hexadecimal values could be read in, but this
time invert the scan so that hexadecimal format
with a leading 0x is tried first, and if that
fails, then try decimal format.

Also use a logical-AND instead of nesting double
if-conditional.

See commit "drm/amdgpu: Fix a bug for input with double sscanf"

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:44 -04:00
Luben Tuikov
737c375b88 drm/amdgpu: Fix kernel-doc for the RAS sysfs interface
Imporve the kernel-doc for the RAS sysfs
interface. Fix the grammar, fix the context.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:44 -04:00
Luben Tuikov
7fb6407145 drm/amdgpu: Add bad_page_cnt_threshold to debugfs
Add bad_page_cnt_threshold to debugfs, an optional
file system used for debugging, for reporting
purposes only--it usually matches the size of
EEPROM but may be different depending on the
"bad_page_threshold" kernel module option.

The "bad_page_cnt_threshold" is a dynamically
computed value. It depends on three things: the
VRAM size; the size of the EEPROM (or the size
allocated to the RAS table therein); and the
"bad_page_threshold" module parameter. It is a
dynamically computed value, when the amdgpu module
is run, on which further parameters and logic
depend, and as such it is helpful to see the
dynamically computed value in debugfs.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:44 -04:00
Luben Tuikov
80b0cd0fb9 drm/amdgpu: Fix a bug in checking the result of reserve page
Fix if (ret) --> if (!ret), a bug, for
"retire_page", which caused the kernel to recall
the method with *pos == end of file, and that
bounced back with error. On the first run, we
advanced *pos, but returned 0 back to fs layer,
also a bug.

Fix the logic of the check of the result of
amdgpu_reserve_page_direct()--it is 0 on success,
and non-zero on error, not the other way
around. This patch fixes this bug.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:44 -04:00
Luben Tuikov
6cb7a1d40a drm/amdgpu: Fix a bug for input with double sscanf
Remove double-sscanf to scan for %llu and 0x%llx,
as that is not going to work!

The %llu will consume the "0" in "0x" of your
input, and the hex value you think you're entering
will always be 0. That is, a valid hex value can
never be consumed.

On the other hand, just entering a hex number
without leading 0x will either be scanned as a
string and not match, for instance FAB123, or
the leading decimal portion is scanned as the
%llu, for instance 123FAB will be scanned as 123,
which is not correct.

Thus remove the first %llu scan and leave only the
%llx scan, removing the leading 0x since %llx can
scan either.

Addresses are usually always hex values, so this
suffices.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Xinhui Pan <xinhui.pan@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-15 16:32:44 -04:00
John Clements
cbb8f989d5 drm/amdgpu: page retire over debugfs mechanism
added support in RAS debugfs to add bad page for isolated page retirement testing

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:58:28 -04:00
John Clements
134d16d50f drm/amdgpu: RAS harvest on driver load
In event of RAS UE + warm reset, error counters shall be harvested and cleared on driver load

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:54:51 -04:00
Hawking Zhang
719a9b3323 drm/amdgpu: split gfx callbacks into ras and non-ras ones
gfx ras is only available in cerntain ip generations.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:51:22 -04:00
Hawking Zhang
8bc7b360ad drm/amdgpu: split mmhub callbacks into ras and non-ras ones
mmhub ras is only avaiable in cerntain mmhub ip
generation.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:51:19 -04:00
Hawking Zhang
49070c4ea3 drm/amdgpu: split umc callbacks to ras and non-ras ones
umc ras is not managed by gpu driver when gpu is
connected to cpu through xgmi. split umc callbacks
into ras and non-ras ones so gpu driver only
initializes umc ras callbacks when it manages
umc ras.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:51:11 -04:00
Hawking Zhang
52137ca852 drm/amdgpu: move xgmi ras functions to xgmi_ras_funcs
xgmi ras is not managed by gpu driver when gpu is
connected to cpu through xgmi. move all xgmi ras
functions to xgmi_ras_funcs so gpu driver only
initializes xgmi ras functions when it manages
xgmi ras.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:51:07 -04:00
Hawking Zhang
6e36f23193 drm/amdgpu: split nbio callbacks into ras and non-ras ones
nbio ras is not managed by gpu driver when gpu is
connected to cpu through xgmi. split nbio callbacks
into ras and non-ras ones so gpu driver only
initializes nbio ras callbacks when it manages
nbio ras.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:51:04 -04:00
Hawking Zhang
75f06251c9 drm/amdgpu: initialze ras caps per paltform config
Driver only manages GFX/SDMA/MMHUB RAS in platforms
that gpu node is connected to cpu through XGMI, other
than that, it queries VBIOS for RAS capabilities.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:50:48 -04:00
Bernard Zhao
f08726868c drm/amd: cleanup coding style a bit
Fix patch check warning:
WARNING: suspect code indent for conditional statements (8, 17)
+       if (obj && obj->use < 0) {
+                DRM_ERROR("RAS ERROR: Unbalance obj(%s) use\n", obj->head.name);

WARNING: braces {} are not necessary for single statement blocks
+       if (obj && obj->use < 0) {
+                DRM_ERROR("RAS ERROR: Unbalance obj(%s) use\n", obj->head.name);
+       }

Signed-off-by: Bernard Zhao <bernard@vivo.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:50:31 -04:00
Stanley.Yang
5a43452704 drm/amdgpu: support sdma error injection
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reivewed-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:50:23 -04:00
Tian Tao
36000c7a51 drm/amdgpu: Convert sysfs sprintf/snprintf family to sysfs_emit
Fix the following coccicheck warning:
drivers/gpu//drm/amd/amdgpu/amdgpu_ras.c:434:9-17: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_xgmi.c:220:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_xgmi.c:249:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/df_v3_6.c:208:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_psp.c:2973:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:75:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:112:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:58:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:93:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_vram_mgr.c:125:9-17: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_gtt_mgr.c:52:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_gtt_mgr.c:71:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:140:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:164:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:186:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_device.c:208:8-16: WARNING:
use scnprintf or sprintf
drivers/gpu//drm/amd/amdgpu/amdgpu_atombios.c:1916:8-16: WARNING:
use scnprintf or sprintf

Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:43:47 -04:00
Luben Tuikov
084e2640e5 drm/amdgpu: Fix check for RAS support
Use positive logic to check for RAS
support. Rename the function to actually indicate
what it is testing for. Essentially, make the
function a predicate with the correct name.

Cc: Stanley Yang <Stanley.Yang@amd.com>
Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-04-09 16:43:18 -04:00
Christian König
e5c04edfcd drm/amdgpu: revert "reserve backup pages for bad page retirment"
As noted during the review this approach doesn't make sense at all.

We should not apply any limitation on the VRAM applications can use inside the kernel.

If an application or end user wants to reserve a certain amount of VRAM for bad pages handling we should do this in the upper layer.

This reverts commit f89b881c81.

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23 23:40:06 -04:00
Stanley.Yang
970fd19764 drm/amdgpu: fix send ras disable cmd when asic not support ras
cause:
	It is necessary to send ras disable command to ras-ta during gfx
	block ras later init, because the ras capability is disable read
	from vbios for vega20 gaming, but the ras context is released
	during ras init process, this will cause send ras disable command
	to ras-to failed.
    how:
	Delay releasing ras context, the ras context
	will be released after gfx block later init done.

Changed from V1:
    move release_ras_context into ras_resume

Changed from V2:
    check BIT(UMC) is more reasonable before access eeprom table

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23 23:30:12 -04:00
Hawking Zhang
b69d5c7e95 drm/amdgpu: support query ecc cap for SIENNA_CICHLID
driver needs to query umc_info_v3_3 for ecc capability
in sienna_cichlid

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23 23:29:34 -04:00
shaoyunl
e5086659d0 drm/amdgpu: skip read eeprom for device that pending on XGMI reset
Read eeprom through SMU doesn't works stable on XGMI reset during test.
skip it for now

Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
Reviewed-by: Feifei Xu <Feifei.Xu@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23 23:25:39 -04:00
Dennis Li
761d86d37f drm/amdgpu: harvest edc status when connected to host via xGMI
When connected to a host via xGMI, system fatal errors may trigger
warm reset, driver has no change to query edc status before reset.
Therefore in this case, driver should harvest previous error loging
registers during boot, instead of only resetting them.

v2:
1. IP's ras_manager object is created when its ras feature is enabled,
so change to query edc status after amdgpu_ras_late_init called

2. change to enable watchdog timer after finishing gfx edc init

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reivewed-by: Hawking Zhang <hawking.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23 23:00:41 -04:00
Dennis Li
88f8575bca drm/amdgpu: enable watchdog feature for SQ of aldebaran
SQ's watchdog timer monitors forward progress, a mask of which waves
caused the watchdog timeout is recorded into ras status registers and
then trigger a system fatal error event.

v2:
1. change *query_timeout_status to *query_sq_timeout_status.
2. move query_sq_timeout_status into amdgpu_ras_do_recovery.
3. add module parameters to enable/disable fatal error event and modify
the watchdog timer.

v3:
1. remove unused parameters of *enable_watchdog_timer

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-03-23 22:59:52 -04:00
Dennis Li
11003c68b1 drm/amdgpu: remove unnecessary reading for epprom header
If the number of badpage records exceed the threshold, driver has
updated both epprom header and control->tbl_hdr.header before gpu reset,
therefore GPU recovery thread no need to read epprom header directly.

v2: merge amdgpu_ras_check_err_threshold into amdgpu_ras_eeprom_check_err_threshold

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-26 17:23:49 -05:00
Dennis Li
f89b881c81 drm/amdgpu: reserve backup pages for bad page retirment
To ensure user has a constant of VRAM accessible in run-time, driver
reserves limit backup pages when init, and return ones when bad pages
retired, to keep no change of unused memory size.

v2: refine codes to calculate badpags threshold

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-26 17:00:56 -05:00
Nirmoy Das
ea1b8c9b83 drm/amdgpu: mark local function as static
Mark amdgpu_ras_debugfs_create_ctrl_node() as static.

Fixes: eb14235668777b ("drm/amdgpu: do not keep debugfs dentry")
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-18 16:43:11 -05:00
Nirmoy Das
88293c03c8 drm/amdgpu: do not keep debugfs dentry
Cleanup unnecessary debugfs dentries and surrounding functions.

v3: remove return value check for debugfs_create_file()
v2: remove ttm_debugfs_entries array.
    do not init variables.

Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-02-18 16:43:09 -05:00
Guchun Chen
fe2d9f5abf drm/amdgpu: toggle on DF Cstate after finishing xgmi injection
Fixes: 5c23e9e05e ("drm/amdgpu: Update RAS XGMI error inject sequence")
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-01-15 15:20:00 -05:00
Dennis Li
732f2a307c drm/amdgpu: fix no bad_pages issue after umc ue injection
old code wrongly used the bad page status as the function return value,
which cause amdgpu_ras_badpages_read always return failed.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-01-05 11:35:33 -05:00
Arnd Bergmann
cedf788459 drm/amdgpu: fix debugfs creation/removal, again
There is still a warning when CONFIG_DEBUG_FS is disabled:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1145:13: error: 'amdgpu_ras_debugfs_create_ctrl_node' defined but not used [-Werror=unused-function]
 1145 | static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)

Change the code again to make the compiler actually drop
this code but not warn about it.

Fixes: ae2bf61ff3 ("drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS")
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-12-08 23:02:05 -05:00
Lee Jones
cd92df9350 drm/amd/amdgpu/amdgpu_ras: Make local function 'amdgpu_ras_error_status_query' static
Fixes the following W=1 kernel build warning(s):

 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1482:6: warning: no previous prototype for ‘amdgpu_ras_error_status_query’ [-Wmissing-prototypes]

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-11-13 17:29:47 -05:00
Lee Jones
32dc53480a drm/amd/amdgpu/amdgpu_ras: Remove unused function 'amdgpu_ras_error_cure'
Fixes the following W=1 kernel build warning(s):

 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:908:5: warning: no previous prototype for ‘amdgpu_ras_error_cure’ [-Wmissing-prototypes]

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Lee Jones <lee.jones@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-11-13 17:29:47 -05:00
Deepak R Varma
f3729f7b1a drm/amdgpu/amdgpu: improve code indentation and alignment
General code indentation and alignment changes such as replace spaces
by tabs or align function arguments as per the coding style
guidelines. The patch corrects issues for various amdgpu_*.c files
for this driver. Issue reported by checkpatch script.

Signed-off-by: Deepak R Varma <mh12gx2825@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-11-02 15:34:29 -05:00
Tom Rix
aec576f9d2 drm/amdgpu: remove unneeded semicolon
A semicolon is not needed after a switch statement.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-30 01:02:26 -04:00
Dennis Li
676deb3877 drm/amdgpu: fix the issue of reserving bad pages failed
In amdgpu_ras_reset_gpu, because bad pages may not be freed,
it has high probability to reserve bad pages failed.

Change to reserve bad pages when freeing VRAM.

v2:
1. avoid allocating the drm_mm node outside of amdgpu_vram_mgr.c
2. move bad page reserving into amdgpu_ras_add_bad_pages, if vram mgr
   reserve bad page failed, it will put it into pending list, otherwise
   put it into processed list;
3. remove amdgpu_ras_release_bad_pages, because retired page's info has
   been moved into amdgpu_vram_mgr

v3:
1. formate code style;
2. rename amdgpu_vram_reserve_scope as amdgpu_vram_reservation;
3. rename scope_pending as reservations_pending;
4. rename scope_processed as reserved_pages;
5. change to iterate over all the pending ones and try to insert them
   with drm_mm_reserve_node();

v4:
1. rename amdgpu_vram_mgr_reserve_scope as
amdgpu_vram_mgr_reserve_range;
2. remove unused include "amdgpu_ras.h";
3. rename amdgpu_vram_mgr_check_and_reserve as
amdgpu_vram_mgr_do_reserve;
4. refine amdgpu_vram_mgr_reserve_range to call
amdgpu_vram_mgr_do_reserve.

Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Hawking Zhang <hawking.zhang@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Wenhui Sheng <Wenhui.Sheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-30 00:57:29 -04:00
Dennis Li
22503d803d drm/amdgpu: change to save bad pages in UMC error interrupt callback
Instead of saving bad pages in amdgpu_ras_reset_gpu, it will reduce
the unnecessary calling of amdgpu_ras_save_bad_pages.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Hawking Zhang <hawking.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-30 00:57:17 -04:00
John Clements
4d2aae33d9 Revert drm/amdgpu: disable sienna chichlid UMC RAS
This reverts commit 265c280a48.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-10-21 16:15:31 -04:00
Alex Deucher
a069a9eb73 drm/amdgpu: fix a warning in amdgpu_ras.c (v2)
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c: In function ‘amdgpu_ras_fs_init’:
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1284:2: warning: ignoring return value of ‘sysfs_create_group’, declared with attribute warn_unused_result [-Wunused-result]
 1284 |  sysfs_create_group(&adev->dev->kobj, &group);
      |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

v2: just print an error for sysfs group creation failure

Acked-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-25 17:03:22 -04:00
Guchun Chen
c3d4d45db2 drm/amdgpu: clean up ras sysfs creation (v2)
Merge ras sysfs creation together by calling sysfs_create_group
once, as sysfs_update_group may not work properly as expected.

v2: improve commit message

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-25 17:03:22 -04:00
John Clements
265c280a48 drm/amdgpu: disable sienna chichlid UMC RAS
disable UMC RAS in lieu of stability issues on certain sku

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-25 16:55:26 -04:00
Stanley.Yang
3f975d0f71 drm/amdgpu: update athub interrupt harvesting handle
GCEA/MMHUB EA error should not result to DF freeze, this is
fixed in next generation, but for some reasons the GCEA/MMHUB
EA error will result to DF freeze in previous generation,
diver should avoid to indicate GCEA/MMHUB EA error as hw fatal
error in kernel message by read GCEA/MMHUB err status registers.

Changed from V1:
    make query_ras_error_status function more general
    make read mmhub er status register more friendly

Changed from V2:
    move ras error status query function into do_recovery workqueue

Changed from V3:
    remove useless code from V2, print GCEA error status
    instance number

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-22 17:37:38 -04:00
Stanley.Yang
5436ab94cd drm/amdkfd: fix set kfd node ras properties value
The ctx->features are new RAS implementation which
is only available for Vega20 and onwards, it is not
available for vega10, vega10 should follow legacy
ECC implementation.

Changed from V1:
    wrap function to initialize kfd node properties

Changed from V2:
    remove wrap function and SDMA SRAM ECC check

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-26 16:40:19 -04:00
Luben Tuikov
4a580877bd drm/amdgpu: Get DRM dev from adev by inline-f
Add a static inline adev_to_drm() to obtain
the DRM device pointer from an amdgpu_device pointer.

Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-24 13:06:06 -04:00
Dennis Li
d95e8e97e2 drm/amdgpu: refine create and release logic of hive info
Change to dynamically create and release hive info object,
which help driver support more hives in the future.

v2:
Change to save hive object pointer in adev, to avoid locking
xgmi_mutex every time when calling amdgpu_get_xgmi_hive.

v3:
1. Change type of hive object pointer in adev from void* to
amdgpu_hive_info*.
2. remove unnecessary variable initialization.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-24 12:24:14 -04:00
Dennis Li
53b3f8f40e drm/amdgpu: refine codes to avoid reentering GPU recovery
if other threads have holden the reset lock, recovery will
fail to try_lock. Therefore we introduce atomic hive->in_reset
and adev->in_gpu_reset, to avoid reentering GPU recovery.

v2:
drop "? true : false" in the definition of amdgpu_in_reset

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-24 12:22:56 -04:00
Guchun Chen
ae2bf61ff3 drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS
It can avoid potential build warn/error when
CONFIG_DEBUG_FS is not set.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-14 16:22:40 -04:00
Guchun Chen
2e2f5dd514 drm/amdgpu: fix NULL pointer access issue when unloading driver
When unloading driver by "modprobe -r amdgpu", one NULL pointer
dereference bug occurs in ras debugfs releasing. The cause is the
duplicated debugfs_remove, as drm debugfs_root dir has been cleaned
up already by drm_minor_unregister.

BUG: kernel NULL pointer dereference, address: 00000000000000a0
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 11 PID: 1526 Comm: modprobe Tainted: G           OE     5.6.0-guchchen #1
Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018
RIP: 0010:down_write+0x15/0x40
Code: eb de e8 7e 17 72 ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 53 48 89 fb e8 92
d8 ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 13 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 43 08 5b c3
RSP: 0018:ffffb1590386fcd0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000000000000a0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff85b2fcc2 RDI: 00000000000000a0
RBP: ffffb1590386fd30 R08: ffffffff85b2fcc2 R09: 000000000002b3c0
R10: ffff97a330618c40 R11: 00000000000005f6 R12: ffff97a3481beb40
R13: 00000000000000a0 R14: ffff97a3481beb40 R15: 0000000000000000
FS:  00007fb11a717540(0000) GS:ffff97a376cc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000a0 CR3: 00000004066d6006 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 simple_recursive_removal+0x63/0x370
 ? debugfs_remove+0x60/0x60
 debugfs_remove+0x40/0x60
 amdgpu_ras_fini+0x82/0x230 [amdgpu]
 ? __kernfs_remove.part.17+0x101/0x1f0
 ? kernfs_name_hash+0x12/0x80
 amdgpu_device_fini+0x1c0/0x580 [amdgpu]
 amdgpu_driver_unload_kms+0x3e/0x70 [amdgpu]
 amdgpu_pci_remove+0x36/0x60 [amdgpu]
 pci_device_remove+0x3b/0xb0
 device_release_driver_internal+0xe5/0x1c0
 driver_detach+0x46/0x90
 bus_remove_driver+0x58/0xd0
 pci_unregister_driver+0x29/0x90
 amdgpu_exit+0x11/0x25 [amdgpu]
 __x64_sys_delete_module+0x13d/0x210
 do_syscall_64+0x5f/0x250
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-14 16:22:40 -04:00
Christian König
f1403342eb drm/amdgpu: revert "fix system hang issue during GPU reset"
The whole approach wasn't thought through till the end.

We already had a reset lock like this in the past and it caused the same problems like this one.

Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary.

This reverts commit df9c8d1aa2.

Signed-off-by: Christian König <christian.koenig@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-14 16:22:40 -04:00
Nirmoy Das
2f53072434 drm/amdgpu: pass NULL pointer instead of 0
Fixes: c030f2e416 ("drm/amdgpu: add amdgpu_ras.c to support ras (v2)")
Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-14 16:22:40 -04:00
Guchun Chen
66459e1db2 drm/amdgpu: add debugfs node to toggle ras error cnt harvest
Before ras recovery is issued, user could operate this debugfs
node to enable/disable the harvest of all RAS IPs' ras error
count registers, which will help keep hardware's registers'
status instead of cleaning up them.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-14 16:12:47 -04:00
Guchun Chen
f75e94d868 drm/amdgpu: bypass querying ras error count registers
Once ras recovery is issued by ras sync flood interrupt or
ras controller interrupt, add this guard to bypass or execute
ras error count register harvest of all IPs.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-14 16:12:22 -04:00
John Clements
0ad7a64d69 drm/amdgpu: enable RAS support for sienna cichlid
enabled GECC error injection and query support

Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:29:08 -04:00
Guchun Chen
a219ecbb83 drm/amdgpu: disable page reservation when amdgpu_bad_page_threshold = 0
When amdgpu_bad_page_threshold = 0, bad page reservation stuffs
are skipped in either UMC ECC irq or page retirement calling of
sync flood isr.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:27:20 -04:00
Guchun Chen
f848159b57 drm/amdgpu: decouple sysfs creating of bad page node
Bad page information should not be exposed by sysfs when
bad page retirement is disabled, so decouple it from ras
sysfs group creating, and add one guard before creating.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:27:16 -04:00
Guchun Chen
eb0c3cd48f drm/amdgpu: add one definition for RAS's sysfs/debugfs name(v2)
Add one definition for the RAS module's FS name. It's used
in both debugfs and sysfs cases.

v2: Use static variable instead of macro definition.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:27:08 -04:00
Guchun Chen
bf0b91b78f drm/amdgpu: restore ras flags when user resets eeprom(v2)
RAS flags needs to be cleaned as well when user requires
one clean eeprom.

v2: RAS flags shall be restored after eeprom reset succeeds.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:27:04 -04:00
Guchun Chen
e8fbaf0342 drm/amdgpu: break GPU recovery once it's in bad state(v4)
When GPU executes recovery and retriving bad GPU tag
from external eerpom device, the recovery will be broken
and error message is printed as well for user's awareness.

v2: Refine warning message in threshold reaching case, and
    fix spelling typo.

v3: Fix explicit calling of bad gpu.

v4: Rename function names.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:26:54 -04:00
Guchun Chen
35cd2cdadb drm/amdgpu: skip bad page reservation once issuing from eeprom write
Once the ras recovery is issued from eeprom write itself,
bad page reservation should be ignored, otherwise, recursive
calling of writting to eeprom would happen.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:26:38 -04:00
Guchun Chen
b82e65a935 drm/amdgpu: break driver init process when it's bad GPU(v5)
When retrieving bad gpu tag from eeprom, GPU init should
fail as the GPU needs to be retired for further check.

v2: Fix spelling typo, correct the condition to detect
    bad gpu tag and refine error message.

v3: Refine function argument name.

v4: Fix missing check of returning value of i2c
    initialization error case.

v5: Use dev_err to print PCI information in dmesg instead
    of DRM_ERROR.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:26:31 -04:00
Guchun Chen
c84d46707e drm/amdgpu: validate bad page threshold in ras(v3)
Bad page threshold value should be valid in the range between
-1 and max records length of eeprom. It could determine when
saved bad pages exceed threshold value, and proceed corresponding
actions.

v2: When using the default typical value, it should be min
value between typical value and eeprom max records length.

v3: drop the case of setting bad_page_cnt_threshold to be
    0xFFFFFFFF, as it confuses user.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-04 17:25:58 -04:00
Dennis Li
df9c8d1aa2 drm/amdgpu: fix system hang issue during GPU reset
when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
re-entering GPU recovery.

During GPU reset and resume, it is unsafe that other threads access GPU,
which maybe cause GPU reset failed. Therefore the new rw_semaphore
adev->reset_sem is introduced, which protect GPU from being accessed by
external threads during recovery.

v2:
1. add rwlock for some ioctls, debugfs and file-close function.
2. change to use dqm->is_resetting and dqm_lock for protection in kfd
driver.
3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
re-enter GPU recovery for the same GPU hang.

v3:
1. change back to use adev->reset_sem to protect kfd callback
functions, because dqm_lock couldn't protect all codes, for example:
free_mqd must be called outside of dqm_lock;

[ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[ 1230.177221] Call Trace:
[ 1230.178249]  dump_stack+0x98/0xd5
[ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
[ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
[ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
[ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
[ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
[ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
[ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
[ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
[ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
[ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
[ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
[ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
[ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
[ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
[ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
[ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
[ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
[ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
[ 1230.202831]  ksys_ioctl+0x98/0xb0
[ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
[ 1230.205174]  do_syscall_64+0x5f/0x250
[ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

2. remove try_lock and introduce atomic hive->in_reset, to avoid
re-enter GPU recovery.

v4:
1. remove an unnecessary whitespace change in kfd_chardev.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset

v5:
1. Fix some style issues.

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com>
Suggested-by: Lijo Lazar <Lijo.Lazar@amd.com>
Suggested-by: Luben Tukov <luben.tuikov@amd.com>
Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-27 16:21:37 -04:00
Guchun Chen
b16284259f drm/amdgpu: add printing after executing page reservation to eeprom
This will tell users if the faulty page has been written to
external eeprom device in dmesg log.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-22 18:42:23 -04:00
Wenhui Sheng
bb5c7235ea drm/amdgpu: RAS emergency restart logic refine
If we are in RAS triggered situation and
BACO isn't support, emergency restart is needed,
and this code is only needed for some specific
cases(vega20 with given smu fw version).

After we add smu mode1 reset for sienna cichlid, we
need to share AMD_RESET_METHOD_MODE1 with psp mode1 reset,
so in amdgpu_device_gpu_recover, we need differentiate
which mode1 reset we are using, then decide if it's
a full reset and then decide if emergency restart is needed,
the logic will become much more complex.

After discussion with Hawking, move emergency restart logic
to an independent function.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Signed-off-by: Wenhui Sheng <Wenhui.Sheng@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-15 12:41:47 -04:00
Nirmoy Das
f3167919f6 drm/amdgpu: label internally used symbols as static
Used sparse(make C=1) to find these loose ends.

v2:
removed unwanted extra line

Signed-off-by: Nirmoy Das <nirmoy.das@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-01 01:59:23 -04:00
Guchun Chen
9e69b1ee1d drm/amdgpu: remove useless code in RAS
Module parameter amdgpu_ras_mask has been involved in
the calculation of ras support capability, so drop this
redundant code.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-06-02 16:47:43 -04:00
Guchun Chen
5e91160ac0 drm/amdgpu: fix RAS memory leak in error case
RAS context memory needs to freed in failure case.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-06-02 16:47:20 -04:00
Guchun Chen
b0d4783a38 drm/amdgpu: print warning when input address is invalid
This will assist debug in error injection case.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-28 14:00:49 -04:00
John Clements
5c23e9e05e drm/amdgpu: Update RAS XGMI error inject sequence
Disable XGMI link power down prior to issuing a XGMI RAS error

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-14 17:42:35 -04:00
Arnd Bergmann
7fcffecf79 drm/amdgpu: allocate large structures dynamically
After the structure was padded to 1024 bytes, it is no longer
suitable for being a local variable, as the function surpasses
the warning limit for 32-bit architectures:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:587:5: error: stack frame size of 1072 bytes in function 'amdgpu_ras_feature_enable' [-Werror,-Wframe-larger-than=]
int amdgpu_ras_feature_enable(struct amdgpu_device *adev,
    ^

Use kzalloc() instead to get it from the heap.

Fixes: a0d254820f ("drm/amdgpu: update RAS TA to Host interface")
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-05-05 13:12:55 -04:00
John Clements
a200034b66 drm/amdgpu: update RAS error handling
Parse return status from TA to determine error severity

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-30 16:48:20 -04:00
Dennis Li
a891d239f9 drm/amdgpu: set error query ready after all IPs late init
If set error query ready in amdgpu_ras_late_init, which will
cause some IP blocks aren't initialized, but their error query
is ready.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-22 18:11:49 -04:00
Guchun Chen
12c17b9d62 drm/amdgpu: fix kernel page fault issue by ras recovery on sGPU
When running ras uncorrectable error injection and triggering GPU
reset on sGPU, below issue is observed. It's caused by the list
uninitialized when accessing.

[   80.047227] BUG: unable to handle page fault for address: ffffffffc0f4f750
[   80.047300] #PF: supervisor write access in kernel mode
[   80.047351] #PF: error_code(0x0003) - permissions violation
[   80.047404] PGD 12c20e067 P4D 12c20e067 PUD 12c210067 PMD 41c4ee067 PTE 404316061
[   80.047477] Oops: 0003 [#1] SMP PTI
[   80.047516] CPU: 7 PID: 377 Comm: kworker/7:2 Tainted: G           OE     5.4.0-rc7-guchchen #1
[   80.047594] Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018
[   80.047888] Workqueue: events amdgpu_ras_do_recovery [amdgpu]

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-22 18:11:46 -04:00
Guchun Chen
6952e99cfd drm/amdgpu: refine ras related message print
Prefix ras related kernel message logging with PCI
device info by replacing DRM_INFO/WARN/ERROR with
dev_info/warn/err. This can clearly tell user about
GPU device information where ras is. And add some
other ras message printing to make it more clear
and friendly as well.

Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-13 12:01:50 -04:00
John Clements
b3dbd6d3ec drm/amdgpu: resolve mGPU RAS query instability
upon receiving uncorrectable error, query every GPU node for ras errors

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-09 10:43:15 -04:00
Evan Quan
a9d82d2f91 drm/amdgpu: fix non-pointer dereference for non-RAS supported
Backtrace on gpu recover test on Navi10.

[ 1324.516681] RIP: 0010:amdgpu_ras_set_error_query_ready+0x15/0x20 [amdgpu]
[ 1324.523778] Code: 4c 89 f7 e8 cd a2 a0 d8 e9 99 fe ff ff 45 31 ff e9 91 fe ff ff 0f 1f 44 00 00 55 48 85 ff 48 89 e5 74 0e 48 8b 87 d8 2b 01 00 <40> 88 b0 38 01 00 00 5d c3 66 90 0f 1f 44 00 00 55 31 c0 48 85 ff
[ 1324.543452] RSP: 0018:ffffaa1040e4bd28 EFLAGS: 00010286
[ 1324.549025] RAX: 0000000000000000 RBX: ffff911198b20000 RCX: 0000000000000000
[ 1324.556217] RDX: 00000000000c0a01 RSI: 0000000000000000 RDI: ffff911198b20000
[ 1324.563514] RBP: ffffaa1040e4bd28 R08: 0000000000001000 R09: ffff91119d0028c0
[ 1324.570804] R10: ffffffff9a606b40 R11: 0000000000000000 R12: 0000000000000000
[ 1324.578413] R13: ffffaa1040e4bd70 R14: ffff911198b20000 R15: 0000000000000000
[ 1324.586464] FS:  00007f4441cbf540(0000) GS:ffff91119ed80000(0000) knlGS:0000000000000000
[ 1324.595434] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1324.601345] CR2: 0000000000000138 CR3: 00000003fcdf8004 CR4: 00000000003606e0
[ 1324.608694] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1324.616303] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1324.623678] Call Trace:
[ 1324.626270]  amdgpu_device_gpu_recover+0x6e7/0xc50 [amdgpu]
[ 1324.632018]  ? seq_printf+0x4e/0x70
[ 1324.636652]  amdgpu_debugfs_gpu_recover+0x50/0x80 [amdgpu]
[ 1324.643371]  seq_read+0xda/0x420
[ 1324.647601]  full_proxy_read+0x5c/0x90
[ 1324.652426]  __vfs_read+0x1b/0x40
[ 1324.656734]  vfs_read+0x8e/0x130
[ 1324.660981]  ksys_read+0xa7/0xe0
[ 1324.665201]  __x64_sys_read+0x1a/0x20
[ 1324.669907]  do_syscall_64+0x57/0x1c0
[ 1324.674517]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1324.680654] RIP: 0033:0x7f44417cf081

Signed-off-by: Evan Quan <evan.quan@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-01 14:44:44 -04:00
John Clements
61380faa4b drm/amdgpu: disable ras query and iject during gpu reset
added flag to ras context to indicate if ras query functionality is ready

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-04-01 14:44:42 -04:00
John Clements
43c4d57618 drm/amdgpu: protect RAS sysfs during GPU reset
MMHub EDC becomes dirty after BACO reset

EDC registers should be cleared early on in reset phase

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-20 10:45:00 -04:00
Stanley.Yang
c1509f3f6f drm/amdgpu: fix warning in ras_debugfs_create_all()
Fix the warning
"warn: variable dereferenced before check 'obj' (see line 1131)"
by removing unnecessary checks as amdgpu_ras_debugfs_create_all()
is only called from amdgpu_debugfs_init() where obj member in
con->head list is not NULL.
Use list_for_each_entry() instead list_for_each_entry_safe() as obj
do not to be freeing or removing from list during this process.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-13 11:52:34 -04:00
Guchun Chen
88474ccad5 drm/amdgpu: update ras capability's query based on mem ecc configuration
RAS support capability needs to be updated on top of different
memeory ECC enablement, and remove redundant memory ecc check
in gmc module for vega20 and arcturus.

v2: check HBM ECC enablement and set ras mask accordingly.
v3: avoid to invoke atomfirmware interface to query twice.

Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-13 11:52:34 -04:00
Tao Zhou
204eaac625 drm/amdgpu: call ras_debugfs_create_all in debugfs_init
and remove each ras IP's own debugfs creation

this is required to fix ras when the driver does not use the drm load
and unload callbacks due to ordering issues with the drm device node.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-10 15:55:11 -04:00
Tao Zhou
f9317014ea drm/amdgpu: add function to creat all ras debugfs node
centralize all debugfs creation in one place for ras

this is required to fix ras when the driver does not use the drm load
and unload callbacks due to ordering issues with the drm device node.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-10 15:55:02 -04:00
Hawking Zhang
ec01fe2dbf drm/amdgpu: enable PCS error report on VG20
Now driver will report XGMI/WAFL PCS error through
sysfs xgmi_wafl_err_count node on Vega20

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-03-06 14:31:35 -05:00
Hawking Zhang
19744f5f2d drm/amdgpu: move get_xgmi_relative_phy_addr to amdgpu_xgmi.c
centralize all the xgmi related function to amdgpu_xgmi.c

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Evan Quan <evan.quan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-26 14:17:32 -05:00
Guchun Chen
313c8fd33e drm/amdgpu: log on non-zero error conter per IP before GPU reset
Once sync flood interrupt is triggered by RAS error, before
actual GPU recovery job, it's necessary to log on and print
non-zero error counter, this will help user knows where the
RAS error source is from quickly.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-02-19 10:36:26 -05:00
John Clements
a6c44d2538 drm/amdgpu: added support to get mGPU DRAM base
resolves issue with RAS error injection in mGPU configuration

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-22 16:34:07 -05:00
Hawking Zhang
93af20f74e drm/amdgpu: check if driver should try recovery in ras recovery path
To allow the flexibilty for user to disable gpu recovery
in RAS recovery path by module parameter amdgpu_gpu_recovery

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-16 13:40:47 -05:00
Hawking Zhang
3e81ee9a78 drm/amdgpu: support error reporting for sdma ip block
invoke sdma query_ras_error_count to get sdma single
bit error count

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-01-14 10:18:08 -05:00
Guchun Chen
46cf2fecf5 drm/amdgpu: add missed return value set for error case
Return value should be set when going to error handle tag
for error case, this can avoid potential invalid array
access by upper caller.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-23 14:59:51 -05:00
zhengbin
374bf7bd6a drm/amdgpu: Remove unneeded semicolon in amdgpu_ras.c
Fixes coccicheck warning:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:318:2-3: Unneeded semicolon

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-18 16:09:11 -05:00
Guchun Chen
6193462409 drm/amdgpu: drop useless BACO arg in amdgpu_ras_reset_gpu
BACO reset mode strategy is determined by latter func when
calling amdgpu_ras_reset_gpu. So not to confuse audience, drop
it.

Signed-off-by: Guchun Chen <guchun.chen@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-18 16:09:06 -05:00
Le Ma
f2a79be1c0 drm/amdgpu: export amdgpu_ras_find_obj to use externally
Change it to external interface.

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-05 16:25:20 -05:00
Hawking Zhang
baaeb610b1 drm/amdgpu: enable ras capablity check on arcturus
check hw ras capablity via atomfirmware

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: John Clements <John.Clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-19 09:47:30 -05:00
Alex Deucher
ef177d11d6 drm/amdgpu: Improve RAS documentation (v2)
Clarify some areas, clean up formatting, add section for
unrecoverable error handling.

v2: fix grammatical errors

Reviewed-by: Yong Zhao <yong.zhao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-11-06 16:27:48 -05:00
Le Ma
bff77e86a3 drm/amdgpu: bypass some cleanup work after err_event_athub (v2)
PSP lost connection when err_event_athub occurs. These cleanup work can be
skipped in BACO reset.

v2: squash in missing include (Alex)

Signed-off-by: Le Ma <le.ma@amd.com>
Reviewed-by: Hawking Zhang <hawking.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-10-30 11:06:52 -04:00