-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
devlink: Add Documentation/networking/devlink-health.txt
This patch adds a new file to add information about devlink health mechanism. Signed-off-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
- Loading branch information
Aya Levin
authored and
David S. Miller
committed
Jan 18, 2019
1 parent
ce019fa
commit b8c45a0
Showing
1 changed file
with
86 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
The health mechanism is targeted for Real Time Alerting, in order to know when | ||
something bad had happened to a PCI device | ||
- Provide alert debug information | ||
- Self healing | ||
- If problem needs vendor support, provide a way to gather all needed debugging | ||
information. | ||
|
||
The main idea is to unify and centralize driver health reports in the | ||
generic devlink instance and allow the user to set different | ||
attributes of the health reporting and recovery procedures. | ||
|
||
The devlink health reporter: | ||
Device driver creates a "health reporter" per each error/health type. | ||
Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error) | ||
or unknown (driver specific). | ||
For each registered health reporter a driver can issue error/health reports | ||
asynchronously. All health reports handling is done by devlink. | ||
Device driver can provide specific callbacks for each "health reporter", e.g. | ||
- Recovery procedures | ||
- Diagnostics and object dump procedures | ||
- OOB initial parameters | ||
Different parts of the driver can register different types of health reporters | ||
with different handlers. | ||
|
||
Once an error is reported, devlink health will do the following actions: | ||
* A log is being send to the kernel trace events buffer | ||
* Health status and statistics are being updated for the reporter instance | ||
* Object dump is being taken and saved at the reporter instance (as long as | ||
there is no other dump which is already stored) | ||
* Auto recovery attempt is being done. Depends on: | ||
- Auto-recovery configuration | ||
- Grace period vs. time passed since last recover | ||
|
||
The user interface: | ||
User can access/change each reporter's parameters and driver specific callbacks | ||
via devlink, e.g per error type (per health reporter) | ||
- Configure reporter's generic parameters (like: disable/enable auto recovery) | ||
- Invoke recovery procedure | ||
- Run diagnostics | ||
- Object dump | ||
|
||
The devlink health interface (via netlink): | ||
DEVLINK_CMD_HEALTH_REPORTER_GET | ||
Retrieves status and configuration info per DEV and reporter. | ||
DEVLINK_CMD_HEALTH_REPORTER_SET | ||
Allows reporter-related configuration setting. | ||
DEVLINK_CMD_HEALTH_REPORTER_RECOVER | ||
Triggers a reporter's recovery procedure. | ||
DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE | ||
Retrieves diagnostics data from a reporter on a device. | ||
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET | ||
Retrieves the last stored dump. Devlink health | ||
saves a single dump. If an dump is not already stored by the devlink | ||
for this reporter, devlink generates a new dump. | ||
dump output is defined by the reporter. | ||
DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR | ||
Clears the last saved dump file for the specified reporter. | ||
|
||
|
||
netlink | ||
+--------------------------+ | ||
| | | ||
| + | | ||
| | | | ||
+--------------------------+ | ||
|request for ops | ||
|(diagnose, | ||
mlx5_core devlink |recover, | ||
|dump) | ||
+--------+ +--------------------------+ | ||
| | | reporter| | | ||
| | | +---------v----------+ | | ||
| | ops execution | | | | | ||
| <----------------------------------+ | | | ||
| | | | | | | ||
| | | + ^------------------+ | | ||
| | | | request for ops | | ||
| | | | (recover, dump) | | ||
| | | | | | ||
| | | +-+------------------+ | | ||
| | health report | | health handler | | | ||
| +-------------------------------> | | | ||
| | | +--------------------+ | | ||
| | health reporter create | | | ||
| +----------------------------> | | ||
+--------+ +--------------------------+ |