linux/arch
Paul Mackerras da15c03b04 powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown race
Testing has revealed the existence of a race condition where a XIVE
interrupt being shut down can be in one of the XIVE interrupt queues
(of which there are up to 8 per CPU, one for each priority) at the
point where free_irq() is called.  If this happens, can return an
interrupt number which has been shut down.  This can lead to various
symptoms:

- irq_to_desc(irq) can be NULL.  In this case, no end-of-interrupt
  function gets called, resulting in the CPU's elevated interrupt
  priority (numerically lowered CPPR) never gets reset.  That then
  means that the CPU stops processing interrupts, causing device
  timeouts and other errors in various device drivers.

- The irq descriptor or related data structures can be in the process
  of being freed as the interrupt code is using them.  This typically
  leads to crashes due to bad pointer dereferences.

This race is basically what commit 62e0468650 ("genirq: Add optional
hardware synchronization for shutdown", 2019-06-28) is intended to
fix, given a get_irqchip_state() method for the interrupt controller
being used.  It works by polling the interrupt controller when an
interrupt is being freed until the controller says it is not pending.

With XIVE, the PQ bits of the interrupt source indicate the state of
the interrupt source, and in particular the P bit goes from 0 to 1 at
the point where the hardware writes an entry into the interrupt queue
that this interrupt is directed towards.  Normally, the code will then
process the interrupt and do an end-of-interrupt (EOI) operation which
will reset PQ to 00 (assuming another interrupt hasn't been generated
in the meantime).  However, there are situations where the code resets
P even though a queue entry exists (for example, by setting PQ to 01,
which disables the interrupt source), and also situations where the
code leaves P at 1 after removing the queue entry (for example, this
is done for escalation interrupts so they cannot fire again until
they are explicitly re-enabled).

The code already has a 'saved_p' flag for the interrupt source which
indicates that a queue entry exists, although it isn't maintained
consistently.  This patch adds a 'stale_p' flag to indicate that
P has been left at 1 after processing a queue entry, and adds code
to set and clear saved_p and stale_p as necessary to maintain a
consistent indication of whether a queue entry may or may not exist.

With this, we can implement xive_get_irqchip_state() by looking at
stale_p, saved_p and the ESB PQ bits for the interrupt.

There is some additional code to handle escalation interrupts
properly; because they are enabled and disabled in KVM assembly code,
which does not have access to the xive_irq_data struct for the
escalation interrupt.  Hence, stale_p may be incorrect when the
escalation interrupt is freed in kvmppc_xive_{,native_}cleanup_vcpu().
Fortunately, we can fix it up by looking at vcpu->arch.xive_esc_on,
with some careful attention to barriers in order to ensure the correct
result if xive_esc_irq() races with kvmppc_xive_cleanup_vcpu().

Finally, this adds code to make noise on the console (pr_crit and
WARN_ON(1)) if we find an interrupt queue entry for an interrupt
which does not have a descriptor.  While this won't catch the race
reliably, if it does get triggered it will be an indication that
the race is occurring and needs to be debugged.

Fixes: 243e25112d ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190813100648.GE9567@blackberry
2019-08-16 14:16:59 +10:00
..
alpha Merge branch 'akpm' (patches from Andrew) 2019-07-17 08:58:04 -07:00
arc Merge branch 'akpm' (patches from Andrew) 2019-07-17 08:58:04 -07:00
arm i.MX fixes for 5.3: 2019-07-23 10:13:24 -07:00
arm64 SPDX fixes for 5.3-rc2 2019-07-28 10:00:06 -07:00
c6x Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu 2019-07-10 21:42:03 -07:00
csky treewide: add "WITH Linux-syscall-note" to SPDX tag of uapi headers 2019-07-25 11:05:10 +02:00
h8300 h8300 update for 5.3 2019-07-17 09:36:38 -07:00
hexagon hexagon: switch to generic version of pte allocation 2019-07-21 09:53:00 -07:00
ia64 Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
m68k arch: mark syscall number 435 reserved for clone3 2019-07-15 00:39:33 +02:00
microblaze clone3-v5.3 2019-07-11 10:09:44 -07:00
mips This round of clk driver and framework updates is heavy on the driver update 2019-07-17 10:07:48 -07:00
nds32 treewide: add "WITH Linux-syscall-note" to SPDX tag of uapi headers 2019-07-25 11:05:10 +02:00
nios2 nios2 update for v5.3-rc1 2019-07-12 15:38:05 -07:00
openrisc dma-mapping updates for Linux 5.3 2019-07-12 15:13:55 -07:00
parisc parisc: Flush ITLB in flush_tlb_all_local() only on split TLB machines 2019-07-21 11:03:02 +02:00
powerpc powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown race 2019-08-16 14:16:59 +10:00
riscv SPDX fixes for 5.3-rc2 2019-07-28 10:00:06 -07:00
s390 SPDX fixes for 5.3-rc2 2019-07-28 10:00:06 -07:00
sh treewide: add "WITH Linux-syscall-note" to SPDX tag of uapi headers 2019-07-25 11:05:10 +02:00
sparc treewide: add "WITH Linux-syscall-note" to SPDX tag of uapi headers 2019-07-25 11:05:10 +02:00
um This pull request contains the following changes for UML: 2019-07-14 17:17:34 -07:00
unicore32 Kconfig updates for v5.3 2019-07-12 16:06:27 -07:00
x86 SPDX fixes for 5.3-rc2 2019-07-28 10:00:06 -07:00
xtensa Merge branch 'akpm' (patches from Andrew) 2019-07-17 08:58:04 -07:00
.gitignore
Kconfig Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-07-20 10:33:44 -07:00