Skip to content

Commit

Permalink
Merge tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/lin…
Browse files Browse the repository at this point in the history
…ux/kernel/git/ras/ras

Pull EDAC updates from Borislav Petkov:

 - Add a FRU (Field Replaceable Unit) memory poison manager which
   collects and manages previously encountered hw errors in order to
   save them to persistent storage across reboots. Previously recorded
   errors are "replayed" upon reboot in order to poison memory which has
   caused said errors in the past.

   The main use case is stacked, on-chip memory which cannot simply be
   replaced so poisoning faulty areas of it and thus making them
   inaccessible is the only strategy to prolong its lifetime.

 - Add an AMD address translation library glue which converts the
   reported addresses of hw errors into system physical addresses in
   order to be used by other subsystems like memory failure, for
   example. Add support for MI300 accelerators to that library.

 - igen6: Add support for Alder Lake-N SoC

 - i10nm: Add Grand Ridge support

 - The usual fixlets and cleanups

* tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/versal: Convert to platform remove callback returning void
  RAS/AMD/FMPM: Fix off by one when unwinding on error
  RAS/AMD/FMPM: Add debugfs interface to print record entries
  RAS/AMD/FMPM: Save SPA values
  RAS: Export helper to get ras_debugfs_dir
  RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2()
  RAS: Introduce a FRU memory poison manager
  RAS/AMD/ATL: Add MI300 row retirement support
  Documentation: Move RAS section to admin-guide
  EDAC/versal: Make the bit position of injected errors configurable
  EDAC/i10nm: Add Intel Grand Ridge micro-server support
  EDAC/igen6: Add one more Intel Alder Lake-N SoC support
  RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support
  RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300()
  RAS/AMD/ATL: Add MI300 support
  Documentation: RAS: Add index and address translation section
  EDAC/amd64: Use new AMD Address Translation Library
  RAS: Introduce AMD Address Translation Library
  EDAC/synopsys: Convert to devm_platform_ioremap_resource()
  • Loading branch information
Linus Torvalds committed Mar 12, 2024
2 parents 1f75619 + af65545 commit b040240
Show file tree
Hide file tree
Showing 33 changed files with 5,165 additions and 336 deletions.
24 changes: 24 additions & 0 deletions Documentation/admin-guide/RAS/address-translation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
.. SPDX-License-Identifier: GPL-2.0
Address translation
===================

x86 AMD
-------

Zen-based AMD systems include a Data Fabric that manages the layout of
physical memory. Devices attached to the Fabric, like memory controllers,
I/O, etc., may not have a complete view of the system physical memory map.
These devices may provide a "normalized", i.e. device physical, address
when reporting memory errors. Normalized addresses must be translated to
a system physical address for the kernel to action on the memory.

AMD Address Translation Library (CONFIG_AMD_ATL) provides translation for
this case.

Glossary of acronyms used in address translation for Zen-based systems

* CCM = Cache Coherent Moderator
* COD = Cluster-on-Die
* COH_ST = Coherent Station
* DF = Data Fabric
Original file line number Diff line number Diff line change
@@ -1,15 +1,10 @@
.. SPDX-License-Identifier: GPL-2.0
Reliability, Availability and Serviceability features
=====================================================

This documents different aspects of the RAS functionality present in the
kernel.

Error decoding
---------------
==============

* x86
x86
---

Error decoding on AMD systems should be done using the rasdaemon tool:
https://github.com/mchehab/rasdaemon/
Expand Down
7 changes: 7 additions & 0 deletions Documentation/admin-guide/RAS/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
.. toctree::
:maxdepth: 2

main
error-decoding
address-translation
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>

============================================
Reliability, Availability and Serviceability
============================================
==================================================
Reliability, Availability and Serviceability (RAS)
==================================================

This documents different aspects of the RAS functionality present in the
kernel.

RAS concepts
************
Expand Down
2 changes: 1 addition & 1 deletion Documentation/admin-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ configure specific aspects of kernel behavior to your liking.
pmf
pnp
rapidio
ras
RAS/index
rtc
serial-console
svga
Expand Down
1 change: 0 additions & 1 deletion Documentation/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,6 @@ to ReStructured Text format, or are simply too old.
:maxdepth: 1

staging/index
RAS/ras


Translations
Expand Down
15 changes: 13 additions & 2 deletions MAINTAINERS
Original file line number Diff line number Diff line change
Expand Up @@ -897,6 +897,12 @@ Q: https://patchwork.kernel.org/project/linux-rdma/list/
F: drivers/infiniband/hw/efa/
F: include/uapi/rdma/efa-abi.h

AMD ADDRESS TRANSLATION LIBRARY (ATL)
M: Yazen Ghannam <Yazen.Ghannam@amd.com>
L: linux-edac@vger.kernel.org
S: Supported
F: drivers/ras/amd/atl/*

AMD AXI W1 DRIVER
M: Kris Chaplin <kris.chaplin@amd.com>
R: Thomas Delev <thomas.delev@amd.com>
Expand Down Expand Up @@ -7583,7 +7589,6 @@ R: Robert Richter <rric@kernel.org>
L: linux-edac@vger.kernel.org
S: Supported
T: git git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git edac-for-next
F: Documentation/admin-guide/ras.rst
F: Documentation/driver-api/edac.rst
F: drivers/edac/
F: include/linux/edac.h
Expand Down Expand Up @@ -18379,11 +18384,17 @@ M: Tony Luck <tony.luck@intel.com>
M: Borislav Petkov <bp@alien8.de>
L: linux-edac@vger.kernel.org
S: Maintained
F: Documentation/admin-guide/ras.rst
F: Documentation/admin-guide/RAS
F: drivers/ras/
F: include/linux/ras.h
F: include/ras/ras_event.h

RAS FRU MEMORY POISON MANAGER (FMPM)
M: Yazen Ghannam <Yazen.Ghannam@amd.com>
L: linux-edac@vger.kernel.org
S: Maintained
F: drivers/ras/amd/fmpm.c

RC-CORE / LIRC FRAMEWORK
M: Sean Young <sean@mess.org>
L: linux-media@vger.kernel.org
Expand Down
2 changes: 1 addition & 1 deletion arch/x86/include/asm/topology.h
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ static inline bool topology_is_primary_thread(unsigned int cpu)
static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
static inline int topology_max_smt_threads(void) { return 1; }
static inline bool topology_is_primary_thread(unsigned int cpu) { return true; }
static inline unsigned int topology_amd_nodes_per_pkg(void) { return 0; };
static inline unsigned int topology_amd_nodes_per_pkg(void) { return 1; }
#endif /* !CONFIG_SMP */

static inline void arch_fix_phys_package_id(int num, u32 slot)
Expand Down
1 change: 1 addition & 0 deletions drivers/edac/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ config EDAC_GHES
config EDAC_AMD64
tristate "AMD64 (Opteron, Athlon64)"
depends on AMD_NB && EDAC_DECODE_MCE
imply AMD_ATL
help
Support for error detection and correction of DRAM ECC errors on
the AMD64 families (>= K8) of memory controllers.
Expand Down
Loading

0 comments on commit b040240

Please sign in to comment.