linux-yocto/drivers/ras/Kconfig
Yazen Ghannam 6f15e617cc RAS: Introduce a FRU memory poison manager
Memory errors are an expected occurrence on systems with high memory
density. Generally, errors within a small number of unique physical
locations are acceptable, based on manufacturer and/or admin policy.
During run time, memory with errors may be retired so it is no longer
used by the system. This is done in mm through page poisoning, and the
effect will remain until the system is restarted.

If a memory location is consistently faulty, then the same run time
error handling may occur in the next reboot cycle, leading to
terminating jobs due to that already known bad memory. This could be
prevented if information from the previous boot was not lost.

Some add-in cards with driver-managed memory have on-board persistent
storage. Their driver saves memory error information to the persistent
storage during run time. The information is then restored after reset,
and known bad memory will be retired before the hardware is used.
A running log of bad memory locations is kept across multiple resets.

A similar solution is desirable for CPUs. However, this solution should
leverage industry-standard components as much as possible, rather than
a bespoke platform driver.

Two components are needed: a record format and a persistent storage
interface.

Implement a new module to manage the record formats on persistent
storage. Use the requirements for an AMD MI300-based system to start.
Vendor- and platform-specific details can be abstracted later as needed.

  [ bp: Massage commit message and code, squash 30-ish more fixes from
    Yazen and me. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Co-developed-by: <naveenkrishna.chatradhi@amd.com>
Signed-off-by: <naveenkrishna.chatradhi@amd.com>
Co-developed-by: <muralidhara.mk@amd.com>
Signed-off-by: <muralidhara.mk@amd.com>
Tested-by: <sathyapriya.k@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20240214033516.1344948-3-yazen.ghannam@amd.com
2024-02-20 18:56:15 +01:00

1.9 KiB

SPDX-License-Identifier: GPL-2.0-only

menuconfig RAS bool "Reliability, Availability and Serviceability (RAS) features" help Reliability, availability and serviceability (RAS) is a computer hardware engineering term. Computers designed with higher levels of RAS have a multitude of features that protect data integrity and help them stay available for long periods of time without failure.

  Reliability can be defined as the probability that the system will
  produce correct outputs up to some given time. Reliability is
  enhanced by features that help to avoid, detect and repair hardware
  faults.

  Availability is the probability a system is operational at a given
  time, i.e. the amount of time a device is actually operating as the
  percentage of total time it should be operating.

  Serviceability or maintainability is the simplicity and speed with
  which a system can be repaired or maintained; if the time to repair
  a failed system increases, then availability will decrease.

  Note that Reliability and Availability are distinct concepts:
  Reliability is a measure of the ability of a system to function
  correctly, including avoiding data corruption, whereas Availability
  measures how often it is available for use, even though it may not
  be functioning correctly. For example, a server may run forever and
  so have ideal availability, but may be unreliable, with frequent
  data corruption.

if RAS

source "arch/x86/ras/Kconfig" source "drivers/ras/amd/atl/Kconfig"

config RAS_FMPM tristate "FRU Memory Poison Manager" default m depends on AMD_ATL && ACPI_APEI help Support saving and restoring memory error information across reboot using ACPI ERST as persistent storage. Error information is saved with the UEFI CPER "FRU Memory Poison" section format.

  Memory will be retired during boot time and run time depending on
  platform-specific policies.

endif