linux/drivers/scsi
Suganath Prabu 320e77acb3 scsi: mpt3sas: Irq poll to avoid CPU hard lockups
Issue Description:
We have seen cpu lock up issue from fields if system has greater (more than
96) logical cpu count.  SAS3.0 controller (Invader series) supports at max
96 msix vector and SAS3.5 product (Ventura) supports at max 128 msix
vectors.

This may be a generic issue (if PCI device supports completion on multiple
reply queues).  Let me explain it w.r.t to mpt3sas supported h/w just to
simplify the problem and possible changes to handle such issues. IT HBA
(mpt3sas) supports multiple reply queues in completion path. Driver creates
MSI-x vectors for controller as "min of (FW supported Reply queue, Logical
CPUs)". If submitter is not interrupted via completion on same CPU, there
is a loop in the IO path. This behavior can cause hard/soft CPU lockups, IO
timeout, system sluggish etc.

Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
(e.g. CPU B) is busy with processing the corresponding IO's reply
descriptors from reply descriptor queue upon receiving the interrupts from
HBA.  If the CPU A is continuously pumping the IOs then always CPU B (which
is executing the ISR) will see the valid reply descriptors in the reply
descriptor queue and it will be continuously processing those reply
descriptor in a loop without quitting the ISR handler.

Mpt3sas driver will exit ISR handler if it finds unused reply descriptor in
the reply descriptor queue. Since CPU A will be continuously sending the
IOs, CPU B may always see a valid reply descriptor (posted by HBA Firmware
after processing the IO) in the reply descriptor queue. In worst case,
driver will not quit from this loop in the ISR handler. Eventually, CPU
lockup will be detected by watchdog.

Above mentioned behavior is not common if "rq_affinity" set to 2 or
affinity_hint is honored by irqbalance as "exact". If rq_affinity is set
to 2, submitter will be always interrupted via completion on same CPU.  If
irqbalance is using "exact" policy, interrupt will be delivered to
submitter CPU.

If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is not
1:1, we still have exposure of issue explained above and for that we don't
have any solution.

Exposure of soft/hard lockup if CPU count is more than MSI-x supported by
device.

If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU
counts to MSI-x vector count ratio is something like X:1, where X > 1) then
'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
hard/soft lockups. There won't be any one to one mapping between CPU to
MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is
shared with group/set of CPUs and there is a possibility of having a loop
in the IO path within that CPU group and may observe lockups.

For example: Consider a system having two NUMA nodes and each node having
four logical CPUs and also consider that number of MSI-x vectors enabled on
the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1.  e.g.
MSIx vector 0 is affinity to CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0 and
MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node 1.

numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3                 --> MSI-x 0
node 0 size: 65536 MB
node 0 free: 63176 MB
node 1 cpus: 4 5 6 7                 -->MSI-x 1
node 1 size: 65536 MB
node 1 free: 63176 MB

Assume that user started an application which uses all the CPUs of NUMA
node 0 for issuing the IOs.  Only one CPU from affinity list (it can be any
cpu since this behavior depends upon irqbalance) CPU0 will receive the
interrupts from MSIx vector 0 for all the IOs. Eventually, CPU 0 IO
submission percentage will be decreasing and ISR processing percentage will
be increasing as it is more busy with processing the interrupts.  Gradually
IO submission percentage on CPU 0 will be zero and it's ISR processing
percentage will be 100 percentage as IO loop has already formed within the
NUMA node 0, i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with
submitting the heavy IOs and only CPU 0 is busy in the ISR path as it
always find the valid reply descriptor in the reply descriptor
queue. Eventually, we will observe the hard lockup here.

Chances of occurring of hard/soft lockups are directly proportional to
value of X. If value of X is high, then chances of observing CPU lockups is
high.

Solution: Use IRQ poll interface defined in " irq_poll.c".  mpt3sas driver
will execute ISR routine in Softirq context and it will always quit the
loop based on budget provided in IRQ poll interface.

In these scenarios (i.e. where CPUs count to MSI-X vectors count ratio is
X:1 (where X > 1)), IRQ poll interface will avoid CPU hard lockups due to
voluntary exit from the reply queue processing based on budget.  Note -
Only one MSI-x vector is busy doing processing.

Irqstat output:

IRQs / 1 second(s)
IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
  44    122871   122871   0       0       0  IR-PCI-MSI-edge mpt3sas0-msix0
  45        0              0           0       0       0  IR-PCI-MSI-edge mpt3sas0-msix1

We use this approach only if cpu count is more than FW supported MSI-x
vector

Signed-off-by: Suganath Prabu <suganath-prabu.subramani@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2019-03-18 17:16:43 -04:00
..
aacraid SCSI misc on 20190315 2019-03-16 12:51:50 -07:00
aic7xxx scsi: remove unneeded header search paths 2019-01-29 01:22:21 -05:00
aic94xx scsi: aic94xx: fix calls to dma_set_mask_and_coherent() 2019-02-25 21:37:26 -05:00
arcmsr scsi: arcmsr: Update driver version to v1.40.00.10-20190116 2019-01-22 21:38:21 -05:00
arm scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
be2iscsi genirq/affinity: Add new callback for (re)calculating interrupt sets 2019-02-18 11:21:28 +01:00
bfa SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
bnx2fc SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
bnx2i SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
csiostor SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
cxgbi SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
cxlflash SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
device_handler scsi: return blk_status_t from device handler ->prep_fn 2018-11-09 19:17:14 -07:00
dpt
esas2r scsi: ata: Use unsigned int for cmd's type in ioctls in scsi_host_template 2019-02-08 17:33:00 -05:00
fcoe scsi: fcoe: make use of fip_mode enum complete 2019-02-19 18:58:38 -05:00
fnic scsi: fnic: Remove set but not used variable 'vdev' 2019-01-29 01:16:09 -05:00
hisi_sas SCSI misc on 20190315 2019-03-16 12:51:50 -07:00
ibmvscsi scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
ibmvscsi_tgt scsi: target/core: Remove the write_pending_status() callback function 2019-02-04 21:23:59 -05:00
isci scsi: isci: initialize shost fully before calling scsi_add_host() 2019-01-08 22:27:24 -05:00
libfc Revert "scsi: libfc: Add WARN_ON() when deleting rports" 2019-02-04 22:17:33 -05:00
libsas SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
lpfc SCSI misc on 20190315 2019-03-16 12:51:50 -07:00
megaraid SCSI misc on 20190315 2019-03-16 12:51:50 -07:00
mpt3sas scsi: mpt3sas: Irq poll to avoid CPU hard lockups 2019-03-18 17:16:43 -04:00
mvsas scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
pcmcia scsi: pcmcia: nsp_cs: Remove unnecessary parentheses 2019-01-29 01:28:49 -05:00
pm8001 SCSI fixes on 20190118 2019-01-20 09:15:04 +12:00
qedf SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
qedi SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
qla2xxx SCSI misc on 20190315 2019-03-16 12:51:50 -07:00
qla4xxx SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
smartpqi scsi: smartpqi: bump driver version 2019-03-18 16:48:29 -04:00
snic scsi: snic: no need to check return value of debugfs_create functions 2019-01-29 00:40:54 -05:00
sym53c8xx_2 scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
ufs SCSI misc on 20190315 2019-03-16 12:51:50 -07:00
.gitignore
3w-9xxx.c scsi: 3w-9xxx: fix calls to dma_set_mask_and_coherent() 2019-02-25 21:37:25 -05:00
3w-9xxx.h
3w-sas.c SCSI fixes on 20190302 2019-03-02 11:39:54 -08:00
3w-sas.h
3w-xxxx.c scsi: 3w-xxxx: fix indentation issue, add missing tab 2018-12-19 21:54:07 -05:00
3w-xxxx.h scsi: 3w-xxx: fully convert to the generic DMA API 2018-10-17 21:58:51 -04:00
53c700.c scsi: 53c700: pass correct "dev" to dma_alloc_attrs() 2019-01-29 01:33:00 -05:00
53c700.h scsi: 53c700: Fix spelling of 'NEGOTIATION' 2018-08-30 07:27:22 -04:00
53c700.scr
53c700_d.h_shipped
BusLogic.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
BusLogic.h
FlashPoint.c scsi: FlashPoint: Remove unnecessary parentheses 2018-09-25 20:45:53 -04:00
Kconfig SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
Makefile scsi: remove the SCSI OSD library 2019-02-05 21:28:52 -05:00
NCR5380.c scsi: NCR5380: Return false instead of NULL 2018-11-05 22:47:38 -05:00
NCR5380.h scsi: NCR5380: Have NCR5380_select() return a bool 2018-09-28 02:17:51 -04:00
a100u2w.c cross-tree: phase out dma_zalloc_coherent() 2019-01-08 07:58:37 -05:00
a100u2w.h
a2091.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
a2091.h
a3000.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
a3000.h
a4000t.c
advansys.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
aha152x.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
aha152x.h
aha1542.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
aha1542.h
aha1740.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
aha1740.h scsi: core: remove Scsi_Cmnd typedef 2018-06-19 22:02:25 -04:00
am53c974.c scsi: esp_scsi: move dma mapping into the core code 2018-10-15 23:00:38 -04:00
atari_scsi.c nvram: Replace nvram_* function exports with static functions 2019-01-22 10:21:43 +01:00
atp870u.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
atp870u.h
bvme6000_scsi.c
ch.c scsi: core: check for equality of result byte values 2018-06-26 12:27:06 -04:00
constants.c
dc395x.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
dc395x.h
dmx3191d.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
dpt_i2o.c scsi: dpt_i2o: remove serial number usage 2019-02-27 09:19:23 -05:00
dpti.h scsi: dpt_i2o: stop using scsi_unregister 2018-03-15 00:25:37 -04:00
esp_scsi.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
esp_scsi.h scsi: esp_scsi: De-duplicate PIO routines 2018-10-17 21:38:20 -04:00
g_NCR5380.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
gdth.c scsi: gdth: use generic DMA API 2019-01-08 21:58:35 -05:00
gdth.h scsi: gdth: remove ISA and EISA support 2019-01-08 21:58:35 -05:00
gdth_ioctl.h scsi: gdth: remove dead code under #ifdef GDTH_IOCTL_PROC 2019-01-08 21:58:35 -05:00
gdth_proc.c scsi: gdth: use generic DMA API 2019-01-08 21:58:35 -05:00
gdth_proc.h scsi: gdth: remove gdth_{alloc,free}_ioctl 2019-01-08 21:57:42 -05:00
gvp11.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
gvp11.h
hosts.c SCSI misc on 20181224 2018-12-28 14:48:06 -08:00
hpsa.c scsi: hpsa: bump driver version 2019-03-18 16:46:14 -04:00
hpsa.h scsi: hpsa: correct enclosure sas address 2018-07-10 22:25:03 -04:00
hpsa_cmd.h
hptiop.c scsi: hptiop: fix calls to dma_set_mask() 2019-02-25 21:44:40 -05:00
hptiop.h
imm.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
imm.h
initio.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
initio.h
ipr.c scsi: ata: Use unsigned int for cmd's type in ioctls in scsi_host_template 2019-02-08 17:33:00 -05:00
ipr.h scsi: ipr: System hung while dlpar adding primary ipr adapter back 2018-09-21 12:35:39 -04:00
ips.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
ips.h scsi: ips: properly handle 64-bit DMA 2018-11-06 21:31:28 -05:00
iscsi_boot_sysfs.c
iscsi_tcp.c scsi: remove bidirectional command support 2019-02-05 21:29:21 -05:00
iscsi_tcp.h
jazz_esp.c scsi: esp_scsi: move dma mapping into the core code 2018-10-15 23:00:38 -04:00
lasi700.c
libiscsi.c SCSI misc on 20190315 2019-03-16 12:51:50 -07:00
libiscsi_tcp.c scsi: libiscsi: fall back to sendmsg for slab pages 2019-03-06 19:26:45 -05:00
mac53c94.c scsi: mac53c94: remove DISABLE_CLUSTERING 2018-12-18 23:13:12 -05:00
mac53c94.h
mac_esp.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
mac_scsi.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
megaraid.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
megaraid.h scsi: core: remove Scsi_Cmnd typedef 2018-06-19 22:02:25 -04:00
mesh.c cross-tree: phase out dma_zalloc_coherent() 2019-01-08 07:58:37 -05:00
mesh.h
mvme16x_scsi.c
mvme147.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
mvme147.h
mvumi.c SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
mvumi.h
myrb.c SCSI misc on 20181224 2018-12-28 14:48:06 -08:00
myrb.h scsi: myrb: Add Mylex RAID controller (block interface) 2018-10-17 21:06:49 -04:00
myrs.c scsi: myrs: remove the dma_boundary_limit 2018-12-19 21:43:30 -05:00
myrs.h scsi: myrs: Add Mylex RAID controller (SCSI interface) 2018-10-17 21:07:54 -04:00
ncr53c8xx.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
ncr53c8xx.h
nsp32.c scsi: nsp32: Remove unnecessary self assignment in nsp32_set_sync_entry 2019-01-29 01:26:57 -05:00
nsp32.h
nsp32_debug.c scsi: core: remove Scsi_Cmnd typedef 2018-06-19 22:02:25 -04:00
nsp32_io.h
osst.c scsi: st: osst: Remove negative constant left-shifts 2019-02-27 09:10:16 -05:00
osst.h
osst_detect.h
osst_options.h
pmcraid.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
pmcraid.h scsi: pmcraid: Use sgl_alloc_order() and sgl_free_order() 2018-02-13 21:49:15 -05:00
ppa.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
ppa.h
ps3rom.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
qla1280.c scsi: qla1280: set 64bit coherent mask 2019-01-11 22:30:51 -05:00
qla1280.h
qlogicfas.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
qlogicfas408.c
qlogicfas408.h
qlogicpti.c scsi: qlogicpti: Use of_node_name_eq for node name comparisons 2019-02-13 22:07:03 -05:00
qlogicpti.h scsi: qlogicpti: Use of_node_name_eq for node name comparisons 2019-02-13 22:07:03 -05:00
raid_class.c scsi: raid_attrs: fix unused variable warning 2018-08-30 07:21:04 -04:00
script_asm.pl
scsi.c scsi: kill command serial number 2019-02-27 09:19:24 -05:00
scsi.h scsi: core: remove Scsi_Cmnd typedef 2018-06-19 22:02:25 -04:00
scsi_common.c
scsi_debug.c SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
scsi_debugfs.c scsi: devinfo: use const_ilog2 for array indices 2018-04-20 19:14:28 -04:00
scsi_debugfs.h
scsi_devinfo.c scsi: devinfo: BLIST_RETRY_ASC_C1 for Fujitsu ETERNUS 2018-04-20 19:14:36 -04:00
scsi_dh.c scsi: scsi_dh: replace too broad "TP9" string with the exact models 2018-04-18 19:34:08 -04:00
scsi_error.c scsi: remove bidirectional command support 2019-02-05 21:29:21 -05:00
scsi_ioctl.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
scsi_lib.c SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
scsi_lib_dma.c
scsi_logging.c
scsi_logging.h
scsi_netlink.c
scsi_pm.c scsi: core: Synchronize request queue PM status only on successful resume 2019-01-08 21:57:26 -05:00
scsi_priv.h scsi: kill off the legacy IO path 2018-11-07 13:42:32 -07:00
scsi_proc.c
scsi_sas_internal.h
scsi_scan.c scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c 2019-02-27 09:39:28 -05:00
scsi_sysctl.c
scsi_sysfs.c scsi: kill off the legacy IO path 2018-11-07 13:42:32 -07:00
scsi_trace.c
scsi_transport_api.h
scsi_transport_fc.c bsg: convert to use blk-mq 2018-11-07 13:42:32 -07:00
scsi_transport_iscsi.c SCSI misc on 20181224 2018-12-28 14:48:06 -08:00
scsi_transport_sas.c scsi: bsg-lib: handle bidi requests without block layer help 2019-02-05 21:27:40 -05:00
scsi_transport_spi.c scsi: core: check for equality of result byte values 2018-06-26 12:27:06 -04:00
scsi_transport_srp.c for-4.18/block-20180603 2018-06-04 07:58:06 -07:00
scsicam.c
sd.c SCSI misc on 20190306 2019-03-09 16:53:47 -08:00
sd.h scsi: sd: Fix typo in sd_first_printk() 2019-02-12 22:33:00 -05:00
sd_dif.c block: move dif_prepare/dif_complete functions to block layer 2018-07-30 08:27:02 -06:00
sd_zbc.c scsi: sd_zbc: Fix sd_zbc_report_zones() buffer allocation 2019-02-15 22:09:54 -05:00
sense_codes.h
ses.c treewide: kzalloc() -> kcalloc() 2018-06-12 16:19:22 -07:00
sg.c Remove 'type' argument from access_ok() function 2019-01-03 18:57:57 -08:00
sgiwd93.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
sim710.c
sni_53c710.c
sr.c scsi: stop setting up request->special 2019-02-05 21:29:49 -05:00
sr.h
sr_ioctl.c block: Switch struct packet_command to use struct scsi_sense_hdr 2018-08-02 15:22:13 -06:00
sr_vendor.c
st.c scsi: st: osst: Remove negative constant left-shifts 2019-02-27 09:10:16 -05:00
st.h
st_options.h
stex.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
storvsc_drv.c SCSI misc on 20181224 2018-12-28 14:48:06 -08:00
sun3_scsi.c scsi: remove the use_clustering flag 2018-12-18 23:19:21 -05:00
sun3_scsi_vme.c
sun3x_esp.c scsi: esp_scsi: move dma mapping into the core code 2018-10-15 23:00:38 -04:00
sun_esp.c scsi: sun_esp: Use of_node_name_eq for node name comparisons 2018-12-07 21:56:06 -05:00
virtio_scsi.c scsi: virtio_scsi: don't send sc payload with tmfs 2019-03-06 12:35:02 -05:00
vmw_pvscsi.c SCSI misc on 20181224 2018-12-28 14:48:06 -08:00
vmw_pvscsi.h
wd33c93.c
wd33c93.h
wd719x.c scsi: flip the default on use_clustering 2018-12-18 23:13:12 -05:00
wd719x.h scsi: wd719x: use per-command private data 2018-11-15 14:27:08 -05:00
xen-scsifront.c scsi: xen-scsifront: remove DISABLE_CLUSTERING 2018-12-18 23:13:12 -05:00
zalon.c
zorro7xx.c
zorro_esp.c scsi: esp_scsi: De-duplicate PIO routines 2018-10-17 21:38:20 -04:00