-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
EDAC: Add a memory repair control feature
Add a generic EDAC memory repair control driver to manage memory repairs in the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR features. For example, a CXL device with DRAM components that support PPR features may implement PPR maintenance operations. DRAM components may support two types of PPR: - hard PPR, for a permanent row repair, and - soft PPR, for a temporary row repair. Soft PPR is much faster than hard PPR, but the repair is lost with a power cycle. When a CXL device detects an error in a memory, it may report the need for a repair maintenance operation by using an event record where the "maintenance needed" flag is set. The event records contain the device physical address (DPA) and other optional attributes of the memory to repair. The kernel will report the corresponding CXL general media or DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair operation in response to the device request via the sysfs repair control. Device with memory repair features registers with EDAC device driver, which retrieves a memory repair descriptor from EDAC memory repair driver and exposes the sysfs repair control attributes to userspace in /sys/bus/edac/devices/<dev-name>/mem_repairX/. The common memory repair control interface abstracts the control of arbitrary memory repair functionality into a standardized set of functions. The sysfs memory repair attribute nodes are only available if the client driver has implemented the corresponding attribute callback function and provided operations to the EDAC device driver during registration. [ bp: Massage, fixup edac_dev_register() retvals, merge write_overflow fix to mem_repair_create_desc() ] Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
- Loading branch information
Shiju Jose
authored and
Borislav Petkov (AMD)
committed
Feb 26, 2025
1 parent
bcbd069
commit 699ea52
Showing
9 changed files
with
668 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,149 @@ | ||
What: /sys/bus/edac/devices/<dev-name>/mem_repairX | ||
Date: March 2025 | ||
KernelVersion: 6.15 | ||
Contact: linux-edac@vger.kernel.org | ||
Description: | ||
The sysfs EDAC bus devices /<dev-name>/mem_repairX subdirectory | ||
pertains to the memory media repair features control, such as | ||
PPR (Post Package Repair), memory sparing etc, where <dev-name> | ||
directory corresponds to a device registered with the EDAC | ||
device driver for the memory repair features. | ||
|
||
Post Package Repair is a maintenance operation requests the memory | ||
device to perform a repair operation on its media. It is a memory | ||
self-healing feature that fixes a failing memory location by | ||
replacing it with a spare row in a DRAM device. For example, a | ||
CXL memory device with DRAM components that support PPR features may | ||
implement PPR maintenance operations. DRAM components may support | ||
two types of PPR functions: hard PPR, for a permanent row repair, and | ||
soft PPR, for a temporary row repair. Soft PPR may be much faster | ||
than hard PPR, but the repair is lost with a power cycle. | ||
|
||
The sysfs attributes nodes for a repair feature are only | ||
present if the parent driver has implemented the corresponding | ||
attr callback function and provided the necessary operations | ||
to the EDAC device driver during registration. | ||
|
||
In some states of system configuration (e.g. before address | ||
decoders have been configured), memory devices (e.g. CXL) | ||
may not have an active mapping in the main host address | ||
physical address map. As such, the memory to repair must be | ||
identified by a device specific physical addressing scheme | ||
using a device physical address(DPA). The DPA and other control | ||
attributes to use will be presented in related error records. | ||
|
||
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_type | ||
Date: March 2025 | ||
KernelVersion: 6.15 | ||
Contact: linux-edac@vger.kernel.org | ||
Description: | ||
(RO) Memory repair type. For eg. post package repair, | ||
memory sparing etc. Valid values are: | ||
|
||
- ppr - Post package repair. | ||
|
||
- All other values are reserved. | ||
|
||
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode | ||
Date: March 2025 | ||
KernelVersion: 6.15 | ||
Contact: linux-edac@vger.kernel.org | ||
Description: | ||
(RW) Get/Set the current persist repair mode set for a | ||
repair function. Persist repair modes supported in the | ||
device, based on a memory repair function, either is temporary, | ||
which is lost with a power cycle or permanent. Valid values are: | ||
|
||
- 0 - Soft memory repair (temporary repair). | ||
|
||
- 1 - Hard memory repair (permanent repair). | ||
|
||
- All other values are reserved. | ||
|
||
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_safe_when_in_use | ||
Date: March 2025 | ||
KernelVersion: 6.15 | ||
Contact: linux-edac@vger.kernel.org | ||
Description: | ||
(RO) True if memory media is accessible and data is retained | ||
during the memory repair operation. | ||
The data may not be retained and memory requests may not be | ||
correctly processed during a repair operation. In such case | ||
repair operation can not be executed at runtime. The memory | ||
must be taken offline. | ||
|
||
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/hpa | ||
Date: March 2025 | ||
KernelVersion: 6.15 | ||
Contact: linux-edac@vger.kernel.org | ||
Description: | ||
(RW) Host Physical Address (HPA) of the memory to repair. | ||
The HPA to use will be provided in related error records. | ||
|
||
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/dpa | ||
Date: March 2025 | ||
KernelVersion: 6.15 | ||
Contact: linux-edac@vger.kernel.org | ||
Description: | ||
(RW) Device Physical Address (DPA) of the memory to repair. | ||
The specific DPA to use will be provided in related error | ||
records. | ||
|
||
In some states of system configuration (e.g. before address | ||
decoders have been configured), memory devices (e.g. CXL) | ||
may not have an active mapping in the main host address | ||
physical address map. As such, the memory to repair must be | ||
identified by a device specific physical addressing scheme | ||
using a DPA. The device physical address(DPA) to use will be | ||
presented in related error records. | ||
|
||
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/nibble_mask | ||
Date: March 2025 | ||
KernelVersion: 6.15 | ||
Contact: linux-edac@vger.kernel.org | ||
Description: | ||
(RW) Read/Write Nibble mask of the memory to repair. | ||
Nibble mask identifies one or more nibbles in error on the | ||
memory bus that produced the error event. Nibble Mask bit 0 | ||
shall be set if nibble 0 on the memory bus produced the | ||
event, etc. For example, CXL PPR and sparing, a nibble mask | ||
bit set to 1 indicates the request to perform repair | ||
operation in the specific device. All nibble mask bits set | ||
to 1 indicates the request to perform the operation in all | ||
devices. Eg. for CXL memory repair, the specific value of | ||
nibble mask to use will be provided in related error records. | ||
For more details, See nibble mask field in CXL spec ver 3.1, | ||
section 8.2.9.7.1.2 Table 8-103 soft PPR and section | ||
8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4 | ||
Table 8-105 memory sparing. | ||
|
||
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_hpa | ||
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_hpa | ||
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_dpa | ||
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_dpa | ||
Date: March 2025 | ||
KernelVersion: 6.15 | ||
Contact: linux-edac@vger.kernel.org | ||
Description: | ||
(RW) The supported range of memory address that is to be | ||
repaired. The memory device may give the supported range of | ||
attributes to use and it will depend on the memory device | ||
and the portion of memory to repair. | ||
The userspace may receive the specific value of attributes | ||
to use for a repair operation from the memory device via | ||
related error records and trace events, for eg. CXL DRAM | ||
and CXL general media error records in CXL memory devices. | ||
|
||
What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair | ||
Date: March 2025 | ||
KernelVersion: 6.15 | ||
Contact: linux-edac@vger.kernel.org | ||
Description: | ||
(WO) Issue the memory repair operation for the specified | ||
memory repair attributes. The operation may fail if resources | ||
are insufficient based on the requirements of the memory | ||
device and repair function. | ||
|
||
- 1 - Issue the repair operation. | ||
|
||
- All other values are reserved. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,4 +8,5 @@ EDAC Subsystem | |
:maxdepth: 1 | ||
|
||
features | ||
memory_repair | ||
scrub |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later | ||
========================== | ||
EDAC Memory Repair Control | ||
========================== | ||
|
||
Copyright (c) 2024-2025 HiSilicon Limited. | ||
|
||
:Author: Shiju Jose <shiju.jose@huawei.com> | ||
:License: The GNU Free Documentation License, Version 1.2 without | ||
Invariant Sections, Front-Cover Texts nor Back-Cover Texts. | ||
(dual licensed under the GPL v2) | ||
:Original Reviewers: | ||
|
||
- Written for: 6.15 | ||
|
||
Introduction | ||
------------ | ||
|
||
Some memory devices support repair operations to address issues in their | ||
memory media. Post Package Repair (PPR) and memory sparing are examples of | ||
such features. | ||
|
||
Post Package Repair (PPR) | ||
~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
Post Package Repair is a maintenance operation which requests the memory | ||
device to perform repair operation on its media. It is a memory self-healing | ||
feature that fixes a failing memory location by replacing it with a spare row | ||
in a DRAM device. | ||
|
||
For example, a CXL memory device with DRAM components that support PPR | ||
features implements maintenance operations. DRAM components support those | ||
types of PPR functions: | ||
|
||
- hard PPR, for a permanent row repair, and | ||
- soft PPR, for a temporary row repair. | ||
|
||
Soft PPR is much faster than hard PPR, but the repair is lost after a power | ||
cycle. | ||
|
||
The data may not be retained and memory requests may not be correctly | ||
processed during a repair operation. In such case, the repair operation should | ||
not be executed at runtime. | ||
|
||
For example, for CXL memory devices, see CXL spec rev 3.1 [1]_ sections | ||
8.2.9.7.1.1 PPR Maintenance Operations, 8.2.9.7.1.2 sPPR Maintenance Operation | ||
and 8.2.9.7.1.3 hPPR Maintenance Operation for more details. | ||
|
||
Memory Sparing | ||
~~~~~~~~~~~~~~ | ||
|
||
Memory sparing is a repair function that replaces a portion of memory with | ||
a portion of functional memory at a particular granularity. Memory | ||
sparing has cacheline/row/bank/rank sparing granularities. For example, in | ||
rank memory-sparing mode, one memory rank serves as a spare for other ranks on | ||
the same channel in case they fail. | ||
|
||
The spare rank is held in reserve and not used as active memory until | ||
a failure is indicated, with reserved capacity subtracted from the total | ||
available memory in the system. | ||
|
||
After an error threshold is surpassed in a system protected by memory sparing, | ||
the content of a failing rank of DIMMs is copied to the spare rank. The | ||
failing rank is then taken offline and the spare rank placed online for use as | ||
active memory in place of the failed rank. | ||
|
||
For example, CXL memory devices can support various subclasses for sparing | ||
operation vary in terms of the scope of the sparing being performed. | ||
|
||
Cacheline sparing subclass refers to a sparing action that can replace a full | ||
cacheline. Row sparing is provided as an alternative to PPR sparing functions | ||
and its scope is that of a single DDR row. Bank sparing allows an entire bank | ||
to be replaced. Rank sparing is defined as an operation in which an entire DDR | ||
rank is replaced. | ||
|
||
See CXL spec 3.1 [1]_ section 8.2.9.7.1.4 Memory Sparing Maintenance | ||
Operations for more details. | ||
|
||
.. [1] https://computeexpresslink.org/cxl-specification/ | ||
Use cases of generic memory repair features control | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
1. The soft PPR, hard PPR and memory-sparing features share similar control | ||
attributes. Therefore, there is a need for a standardized, generic sysfs | ||
repair control that is exposed to userspace and used by administrators, | ||
scripts and tools. | ||
|
||
2. When a CXL device detects an error in a memory component, it informs the | ||
host of the need for a repair maintenance operation by using an event | ||
record where the "maintenance needed" flag is set. The event record | ||
specifies the device physical address (DPA) and attributes of the memory | ||
that requires repair. The kernel reports the corresponding CXL general | ||
media or DRAM trace event to userspace, and userspace tools (e.g. | ||
rasdaemon) initiate a repair maintenance operation in response to the | ||
device request using the sysfs repair control. | ||
|
||
3. Userspace tools, such as rasdaemon, request a repair operation on a memory | ||
region when maintenance need flag set or an uncorrected memory error or | ||
excess of corrected memory errors above a threshold value is reported or an | ||
exceed corrected errors threshold flag set for that memory. | ||
|
||
4. Multiple PPR/sparing instances may be present per memory device. | ||
|
||
5. Drivers should enforce that live repair is safe. In systems where memory | ||
mapping functions can change between boots, one approach to this is to log | ||
memory errors seen on this boot against which to check live memory repair | ||
requests. | ||
|
||
The File System | ||
--------------- | ||
|
||
The control attributes of a registered memory repair instance could be | ||
accessed in the /sys/bus/edac/devices/<dev-name>/mem_repairX/ | ||
|
||
sysfs | ||
----- | ||
|
||
Sysfs files are documented in | ||
`Documentation/ABI/testing/sysfs-edac-memory-repair`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.