Documentation: PCI: convert pci-error-recovery.txt to reST

Convert plain text documentation to reStructuredText format and add it to
Sphinx TOC tree.  No essential content change.

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
This commit is contained in:
Changbin Du 2019-05-14 22:47:29 +08:00 committed by Bjorn Helgaas
parent b66357f32f
commit 8a01fa6434
3 changed files with 144 additions and 132 deletions

View File

@ -13,3 +13,4 @@ Linux PCI Bus Subsystem
pci-iov-howto pci-iov-howto
msi-howto msi-howto
acpi-info acpi-info
pci-error-recovery

View File

@ -1,12 +1,13 @@
.. SPDX-License-Identifier: GPL-2.0
PCI Error Recovery ==================
------------------ PCI Error Recovery
February 2, 2006 ==================
Current document maintainer:
Linas Vepstas <linasvepstas@gmail.com> :Authors: - Linas Vepstas <linasvepstas@gmail.com>
updated by Richard Lary <rlary@us.ibm.com> - Richard Lary <rlary@us.ibm.com>
and Mike Mason <mmlnx@us.ibm.com> on 27-Jul-2009 - Mike Mason <mmlnx@us.ibm.com>
Many PCI bus controllers are able to detect a variety of hardware Many PCI bus controllers are able to detect a variety of hardware
@ -63,7 +64,8 @@ mechanisms for dealing with SCSI bus errors and SCSI bus resets.
Detailed Design Detailed Design
--------------- ===============
Design and implementation details below, based on a chain of Design and implementation details below, based on a chain of
public email discussions with Ben Herrenschmidt, circa 5 April 2005. public email discussions with Ben Herrenschmidt, circa 5 April 2005.
@ -73,30 +75,33 @@ pci_driver. A driver that fails to provide the structure is "non-aware",
and the actual recovery steps taken are platform dependent. The and the actual recovery steps taken are platform dependent. The
arch/powerpc implementation will simulate a PCI hotplug remove/add. arch/powerpc implementation will simulate a PCI hotplug remove/add.
This structure has the form: This structure has the form::
struct pci_error_handlers
{
int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
int (*mmio_enabled)(struct pci_dev *dev);
int (*slot_reset)(struct pci_dev *dev);
void (*resume)(struct pci_dev *dev);
};
The possible channel states are: struct pci_error_handlers
enum pci_channel_state { {
pci_channel_io_normal, /* I/O channel is in normal state */ int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
pci_channel_io_frozen, /* I/O to channel is blocked */ int (*mmio_enabled)(struct pci_dev *dev);
pci_channel_io_perm_failure, /* PCI card is dead */ int (*slot_reset)(struct pci_dev *dev);
}; void (*resume)(struct pci_dev *dev);
};
Possible return values are: The possible channel states are::
enum pci_ers_result {
PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */ enum pci_channel_state {
PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */ pci_channel_io_normal, /* I/O channel is in normal state */
PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */ pci_channel_io_frozen, /* I/O to channel is blocked */
PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */ pci_channel_io_perm_failure, /* PCI card is dead */
PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */ };
};
Possible return values are::
enum pci_ers_result {
PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */
PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */
PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */
PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */
PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */
};
A driver does not have to implement all of these callbacks; however, A driver does not have to implement all of these callbacks; however,
if it implements any, it must implement error_detected(). If a callback if it implements any, it must implement error_detected(). If a callback
@ -134,16 +139,17 @@ shouldn't do any new IOs. Called in task context. This is sort of a
All drivers participating in this system must implement this call. All drivers participating in this system must implement this call.
The driver must return one of the following result codes: The driver must return one of the following result codes:
- PCI_ERS_RESULT_CAN_RECOVER:
Driver returns this if it thinks it might be able to recover - PCI_ERS_RESULT_CAN_RECOVER
the HW by just banging IOs or if it wants to be given Driver returns this if it thinks it might be able to recover
a chance to extract some diagnostic information (see the HW by just banging IOs or if it wants to be given
mmio_enable, below). a chance to extract some diagnostic information (see
- PCI_ERS_RESULT_NEED_RESET: mmio_enable, below).
Driver returns this if it can't recover without a - PCI_ERS_RESULT_NEED_RESET
slot reset. Driver returns this if it can't recover without a
- PCI_ERS_RESULT_DISCONNECT: slot reset.
Driver returns this if it doesn't want to recover at all. - PCI_ERS_RESULT_DISCONNECT
Driver returns this if it doesn't want to recover at all.
The next step taken will depend on the result codes returned by the The next step taken will depend on the result codes returned by the
drivers. drivers.
@ -159,25 +165,27 @@ then recovery proceeds to STEP 4 (Slot Reset).
If the platform is unable to recover the slot, the next step If the platform is unable to recover the slot, the next step
is STEP 6 (Permanent Failure). is STEP 6 (Permanent Failure).
>>> The current powerpc implementation assumes that a device driver will .. note::
>>> *not* schedule or semaphore in this routine; the current powerpc
>>> implementation uses one kernel thread to notify all devices;
>>> thus, if one device sleeps/schedules, all devices are affected.
>>> Doing better requires complex multi-threaded logic in the error
>>> recovery implementation (e.g. waiting for all notification threads
>>> to "join" before proceeding with recovery.) This seems excessively
>>> complex and not worth implementing.
>>> The current powerpc implementation doesn't much care if the device The current powerpc implementation assumes that a device driver will
>>> attempts I/O at this point, or not. I/O's will fail, returning *not* schedule or semaphore in this routine; the current powerpc
>>> a value of 0xff on read, and writes will be dropped. If more than implementation uses one kernel thread to notify all devices;
>>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH thus, if one device sleeps/schedules, all devices are affected.
>>> assumes that the device driver has gone into an infinite loop Doing better requires complex multi-threaded logic in the error
>>> and prints an error to syslog. A reboot is then required to recovery implementation (e.g. waiting for all notification threads
>>> get the device working again. to "join" before proceeding with recovery.) This seems excessively
complex and not worth implementing.
The current powerpc implementation doesn't much care if the device
attempts I/O at this point, or not. I/O's will fail, returning
a value of 0xff on read, and writes will be dropped. If more than
EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
assumes that the device driver has gone into an infinite loop
and prints an error to syslog. A reboot is then required to
get the device working again.
STEP 2: MMIO Enabled STEP 2: MMIO Enabled
------------------- --------------------
The platform re-enables MMIO to the device (but typically not the The platform re-enables MMIO to the device (but typically not the
DMA), and then calls the mmio_enabled() callback on all affected DMA), and then calls the mmio_enabled() callback on all affected
device drivers. device drivers.
@ -192,34 +200,36 @@ link reset was performed by the HW. If the platform can't just re-enable IOs
without a slot reset or a link reset, it will not call this callback, and without a slot reset or a link reset, it will not call this callback, and
instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
>>> The following is proposed; no platform implements this yet: .. note::
>>> Proposal: All I/O's should be done _synchronously_ from within
>>> this callback, errors triggered by them will be returned via The following is proposed; no platform implements this yet:
>>> the normal pci_check_whatever() API, no new error_detected() Proposal: All I/O's should be done _synchronously_ from within
>>> callback will be issued due to an error happening here. However, this callback, errors triggered by them will be returned via
>>> such an error might cause IOs to be re-blocked for the whole the normal pci_check_whatever() API, no new error_detected()
>>> segment, and thus invalidate the recovery that other devices callback will be issued due to an error happening here. However,
>>> on the same segment might have done, forcing the whole segment such an error might cause IOs to be re-blocked for the whole
>>> into one of the next states, that is, link reset or slot reset. segment, and thus invalidate the recovery that other devices
on the same segment might have done, forcing the whole segment
into one of the next states, that is, link reset or slot reset.
The driver should return one of the following result codes: The driver should return one of the following result codes:
- PCI_ERS_RESULT_RECOVERED - PCI_ERS_RESULT_RECOVERED
Driver returns this if it thinks the device is fully Driver returns this if it thinks the device is fully
functional and thinks it is ready to start functional and thinks it is ready to start
normal driver operations again. There is no normal driver operations again. There is no
guarantee that the driver will actually be guarantee that the driver will actually be
allowed to proceed, as another driver on the allowed to proceed, as another driver on the
same segment might have failed and thus triggered a same segment might have failed and thus triggered a
slot reset on platforms that support it. slot reset on platforms that support it.
- PCI_ERS_RESULT_NEED_RESET - PCI_ERS_RESULT_NEED_RESET
Driver returns this if it thinks the device is not Driver returns this if it thinks the device is not
recoverable in its current state and it needs a slot recoverable in its current state and it needs a slot
reset to proceed. reset to proceed.
- PCI_ERS_RESULT_DISCONNECT - PCI_ERS_RESULT_DISCONNECT
Same as above. Total failure, no recovery even after Same as above. Total failure, no recovery even after
reset driver dead. (To be defined more precisely) reset driver dead. (To be defined more precisely)
The next step taken depends on the results returned by the drivers. The next step taken depends on the results returned by the drivers.
If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
@ -293,31 +303,33 @@ device will be considered "dead" in this case.
Drivers for multi-function cards will need to coordinate among Drivers for multi-function cards will need to coordinate among
themselves as to which driver instance will perform any "one-shot" themselves as to which driver instance will perform any "one-shot"
or global device initialization. For example, the Symbios sym53cxx2 or global device initialization. For example, the Symbios sym53cxx2
driver performs device init only from PCI function 0: driver performs device init only from PCI function 0::
+ if (PCI_FUNC(pdev->devfn) == 0) + if (PCI_FUNC(pdev->devfn) == 0)
+ sym_reset_scsi_bus(np, 0); + sym_reset_scsi_bus(np, 0);
Result codes: Result codes:
- PCI_ERS_RESULT_DISCONNECT - PCI_ERS_RESULT_DISCONNECT
Same as above. Same as above.
Drivers for PCI Express cards that require a fundamental reset must Drivers for PCI Express cards that require a fundamental reset must
set the needs_freset bit in the pci_dev structure in their probe function. set the needs_freset bit in the pci_dev structure in their probe function.
For example, the QLogic qla2xxx driver sets the needs_freset bit for certain For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
PCI card types: PCI card types::
+ /* Set EEH reset type to fundamental if required by hba */ + /* Set EEH reset type to fundamental if required by hba */
+ if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha)) + if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
+ pdev->needs_freset = 1; + pdev->needs_freset = 1;
+ +
Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
Failure). Failure).
>>> The current powerpc implementation does not try a power-cycle .. note::
>>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
>>> However, it probably should. The current powerpc implementation does not try a power-cycle
reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
However, it probably should.
STEP 5: Resume Operations STEP 5: Resume Operations
@ -370,44 +382,43 @@ The current policy is to turn this into a platform policy.
That is, the recovery API only requires that: That is, the recovery API only requires that:
- There is no guarantee that interrupt delivery can proceed from any - There is no guarantee that interrupt delivery can proceed from any
device on the segment starting from the error detection and until the device on the segment starting from the error detection and until the
slot_reset callback is called, at which point interrupts are expected slot_reset callback is called, at which point interrupts are expected
to be fully operational. to be fully operational.
- There is no guarantee that interrupt delivery is stopped, that is, - There is no guarantee that interrupt delivery is stopped, that is,
a driver that gets an interrupt after detecting an error, or that detects a driver that gets an interrupt after detecting an error, or that detects
an error within the interrupt handler such that it prevents proper an error within the interrupt handler such that it prevents proper
ack'ing of the interrupt (and thus removal of the source) should just ack'ing of the interrupt (and thus removal of the source) should just
return IRQ_NOTHANDLED. It's up to the platform to deal with that return IRQ_NOTHANDLED. It's up to the platform to deal with that
condition, typically by masking the IRQ source during the duration of condition, typically by masking the IRQ source during the duration of
the error handling. It is expected that the platform "knows" which the error handling. It is expected that the platform "knows" which
interrupts are routed to error-management capable slots and can deal interrupts are routed to error-management capable slots and can deal
with temporarily disabling that IRQ number during error processing (this with temporarily disabling that IRQ number during error processing (this
isn't terribly complex). That means some IRQ latency for other devices isn't terribly complex). That means some IRQ latency for other devices
sharing the interrupt, but there is simply no other way. High end sharing the interrupt, but there is simply no other way. High end
platforms aren't supposed to share interrupts between many devices platforms aren't supposed to share interrupts between many devices
anyway :) anyway :)
>>> Implementation details for the powerpc platform are discussed in .. note::
>>> the file Documentation/powerpc/eeh-pci-error-recovery.txt
>>> As of this writing, there is a growing list of device drivers with Implementation details for the powerpc platform are discussed in
>>> patches implementing error recovery. Not all of these patches are in the file Documentation/powerpc/eeh-pci-error-recovery.txt
>>> mainline yet. These may be used as "examples":
>>>
>>> drivers/scsi/ipr
>>> drivers/scsi/sym53c8xx_2
>>> drivers/scsi/qla2xxx
>>> drivers/scsi/lpfc
>>> drivers/next/bnx2.c
>>> drivers/next/e100.c
>>> drivers/net/e1000
>>> drivers/net/e1000e
>>> drivers/net/ixgb
>>> drivers/net/ixgbe
>>> drivers/net/cxgb3
>>> drivers/net/s2io.c
>>> drivers/net/qlge
The End As of this writing, there is a growing list of device drivers with
------- patches implementing error recovery. Not all of these patches are in
mainline yet. These may be used as "examples":
- drivers/scsi/ipr
- drivers/scsi/sym53c8xx_2
- drivers/scsi/qla2xxx
- drivers/scsi/lpfc
- drivers/next/bnx2.c
- drivers/next/e100.c
- drivers/net/e1000
- drivers/net/e1000e
- drivers/net/ixgb
- drivers/net/ixgbe
- drivers/net/cxgb3
- drivers/net/s2io.c
- drivers/net/qlge

View File

@ -12143,7 +12143,7 @@ M: Sam Bobroff <sbobroff@linux.ibm.com>
M: Oliver O'Halloran <oohall@gmail.com> M: Oliver O'Halloran <oohall@gmail.com>
L: linuxppc-dev@lists.ozlabs.org L: linuxppc-dev@lists.ozlabs.org
S: Supported S: Supported
F: Documentation/PCI/pci-error-recovery.txt F: Documentation/PCI/pci-error-recovery.rst
F: drivers/pci/pcie/aer.c F: drivers/pci/pcie/aer.c
F: drivers/pci/pcie/dpc.c F: drivers/pci/pcie/dpc.c
F: drivers/pci/pcie/err.c F: drivers/pci/pcie/err.c
@ -12156,7 +12156,7 @@ PCI ERROR RECOVERY
M: Linas Vepstas <linasvepstas@gmail.com> M: Linas Vepstas <linasvepstas@gmail.com>
L: linux-pci@vger.kernel.org L: linux-pci@vger.kernel.org
S: Supported S: Supported
F: Documentation/PCI/pci-error-recovery.txt F: Documentation/PCI/pci-error-recovery.rst
PCI MSI DRIVER FOR ALTERA MSI IP PCI MSI DRIVER FOR ALTERA MSI IP
M: Ley Foon Tan <lftan@altera.com> M: Ley Foon Tan <lftan@altera.com>