Commit Graph

17866 Commits

Author SHA1 Message Date
Ram Pai 1b4037dead powerpc: helper function to read, write AMR, IAMR, UAMOR registers
Implements helper functions to read and write the key related
registers; AMR, IAMR, UAMOR.

AMR register tracks the read,write permission of a key
IAMR register tracks the execute permission of a key
UAMOR register enables and disables a key

Acked-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:58:58 +11:00
Ram Pai 4fb158f65a powerpc: track allocation status of all pkeys
Total 32 keys are available on power7 and above. However
pkey 0,1 are reserved. So effectively we  have  30 pkeys.

On 4K kernels, we do not  have  5  bits  in  the  PTE to
represent  all the keys; we only have 3bits. Two of those
keys are reserved; pkey 0 and pkey 1. So effectively  we
have 6 pkeys.

This patch keeps track of reserved keys, allocated  keys
and keys that are currently free.

Also it  adds  skeletal  functions  and macros, that the
architecture-independent code expects to be available.

Reviewed-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 22:58:35 +11:00
Ram Pai 92e3da3cf1 powerpc: initial pkey plumbing
Basic  plumbing  to   initialize  the   pkey  system.
Nothing is enabled yet. A later patch will enable it
once all the infrastructure is in place.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
[mpe: Rework copyrights to use SPDX tags]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-20 21:45:03 +11:00
David S. Miller 8565d26bcb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The BPF verifier conflict was some minor contextual issue.

The TUN conflict was less trivial.  Cong Wang fixed a memory leak of
tfile->tx_array in 'net'.  This is an skb_array.  But meanwhile in
net-next tun changed tfile->tx_arry into tfile->tx_ring which is a
ptr_ring.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-19 22:59:33 -05:00
Daniel Borkmann fa9dd599b4 bpf: get rid of pure_initcall dependency to enable jits
Having a pure_initcall() callback just to permanently enable BPF
JITs under CONFIG_BPF_JIT_ALWAYS_ON is unnecessary and could leave
a small race window in future where JIT is still disabled on boot.
Since we know about the setting at compilation time anyway, just
initialize it properly there. Also consolidate all the individual
bpf_jit_enable variables into a single one and move them under one
location. Moreover, don't allow for setting unspecified garbage
values on them.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-19 18:37:00 -08:00
Dan Williams 569d0365f5 dax: require 'struct page' by default for filesystem dax
If a dax buffer from a device that does not map pages is passed to
read(2) or write(2) as a target for direct-I/O it triggers SIGBUS. If
gdb attempts to examine the contents of a dax buffer from a device that
does not map pages it triggers SIGBUS. If fork(2) is called on a process
with a dax mapping from a device that does not map pages it triggers
SIGBUS. 'struct page' is required otherwise several kernel code paths
break in surprising ways. Disable filesystem-dax on devices that do not
map pages.

In addition to needing pfn_to_page() to be valid we also require devmap
pages.  We need this to detect dax pages in the get_user_pages_fast()
path and so that we can stop managing the VM_MIXEDMAP flag. For DAX
drivers that have not supported get_user_pages() to date we allow them
to opt-in to supporting DAX with the CONFIG_FS_DAX_LIMITED configuration
option which requires ->direct_access() to return pfn_t_special() pfns.
This leaves DAX support in brd disabled and scheduled for removal.

Note that when the initial dax support was being merged a few years back
there was concern that struct page was unsuitable for use with next
generation persistent memory devices. The theoretical concern was that
struct page access, being such a hotly used data structure in the
kernel, would lead to media wear out. While that was a reasonable
conservative starting position it has not held true in practice. We have
long since committed to using devm_memremap_pages() to support higher
order kernel functionality that needs get_user_pages() and
pfn_to_page().


Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-19 16:50:53 -08:00
Dan Williams 785a3fab4a mm, dax: introduce pfn_t_special()
In support of removing the VM_MIXEDMAP indication from DAX VMAs,
introduce pfn_t_special() for drivers to indicate that _PAGE_SPECIAL
should be used for DAX ptes. This also helps identify drivers like
dccssblk that only want to use DAX in a read-only fashion without
get_user_pages() support.

Ideally we could delete axonram and dcssblk DAX support, but if we need
to keep it better make it explicit that axonram and dcssblk only support
a sub-set of DAX due to missing _PAGE_DEVMAP support.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-19 16:50:53 -08:00
Linus Torvalds 4917d5df38 powerpc fixes for 4.15 #8
More than we'd like after rc8, but nothing very alarming either, just tying up
 loose ends before the release:
 
 Since we changed powernv to use cpufreq_get() from show_cpuinfo(), we see
 warnings with PREEMPT enabled. But the preempt_disable() in show_cpuinfo()
 doesn't actually prevent CPU hotplug as it suggests, so remove it.
 
 Two updates to the recently merged RFI flush code. Wire up the generic sysfs
 file to report the status, and add a debugfs file to allow enabling/disabling it
 at runtime.
 
 Two updates to xmon, one to add the RFI flush related fields to the paca dump,
 and another to not use hashed pointers in the paca dump.
 
 And one minor fix to add a missing include of linux/types.h in asm/hvcall.h, not
 seen to break the build in upstream, but correct anyway.
 
 Thanks to:
   Benjamin Herrenschmidt, Michal Suchanek, Nicholas Piggin.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJaYcINExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYDQ
 /w//S5OeowmRncTPsiCNlZSLlGsynE7/3QiB3qS0+hGK9vzcdl9Zw5nBxPmA16g+
 53z2pgRpsJvagqR3JFHwYsoOKg157heMZKbrFyV/EWXPUoZQD8Iko88BkNQfV9kH
 dZM4gOuPnn7lLJQJrdc13SRwlOlc1oqOiPqFQtwH5CTlH/I4kVe3iR+oaReA71+U
 bKZQWwXjRqjexLgxfPhaMH9CzfgK6LpM/TEc2tG1YBbKxbTCbXeFYhrNFvLKILHg
 sC/8QDYzFbdgqTrcw6rXpwoxA65Eu8I3puOLpTl/oJ/OQZZBLN6sjjJ33+Aq1yNR
 cXOakJgC+WjNYmw0C2eRBttMG1o0t/g52C9e0DWN2cXU9nwZ5mdFtbQos1kOdaLv
 V5sOpX6HH0+K8jefvw9qdmx4+kma38nyBY1hlsSXkMG5FwXVOx5V/hLFdH4zJs4+
 /9q3Bnwklc7X7Y+Qvqz/5rwi2T4++cIQpngRE2B7uUwjqNQR/iQUrXcb1MBie8LB
 RO7JD8D0ut+Utj75/b6PmBh2tvIHjWvC8BY9mexGfYgU4e/kHbpUa0tN8/OZhYv2
 pm1LP7Qiq+Lt8ii9UjvRKRdXWuxP/8dZ5EI7l7DJkwDItHTzQCDnfCRYjLqvXuDQ
 CGSsyMpQzzReWhWHDn4WJp4f8wRqDyN24d2a95ozaY7HJ4M=
 =zgy0
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.15-8' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "More than we'd like after rc8, but nothing very alarming either, just
  tying up loose ends before the release:

  Since we changed powernv to use cpufreq_get() from show_cpuinfo(), we
  see warnings with PREEMPT enabled. But the preempt_disable() in
  show_cpuinfo() doesn't actually prevent CPU hotplug as it suggests, so
  remove it.

  Two updates to the recently merged RFI flush code. Wire up the generic
  sysfs file to report the status, and add a debugfs file to allow
  enabling/disabling it at runtime.

  Two updates to xmon, one to add the RFI flush related fields to the
  paca dump, and another to not use hashed pointers in the paca dump.

  And one minor fix to add a missing include of linux/types.h in
  asm/hvcall.h, not seen to break the build in upstream, but correct
  anyway.

  Thanks to: Benjamin Herrenschmidt, Michal Suchanek, Nicholas Piggin"

* tag 'powerpc-4.15-8' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/pseries: include linux/types.h in asm/hvcall.h
  powerpc/64s: Allow control of RFI flush via debugfs
  powerpc/64s: Wire up cpu_show_meltdown()
  powerpc: Don't preempt_disable() in show_cpuinfo()
  powerpc/xmon: Don't print hashed pointers in paca dump
  powerpc/xmon: Add RFI flush related fields to paca dump
2018-01-19 11:19:11 -08:00
Christophe Lombard b1db551324 cxl: Add support for ASB_Notify on POWER9
The POWER9 core supports a new feature: ASB_Notify which requires the
support of the Special Purpose Register: TIDR.

The ASB_Notify command, generated by the AFU, will attempt to
wake-up the host thread identified by the particular LPID:PID:TID.

This patch assign a unique TIDR (thread id) for the current thread which
will be used in the process element entry.

Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
Reviewed-by: Philippe Bergheaud <felix@linux.vnet.ibm.com>
Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Reviewed-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 23:19:37 +11:00
Anju T Sudhakar 074db39e00 powerpc/perf: Change the data type for the variable 'ncpu' in IMC code
Change the data type for the variable 'ncpu' in ppc_core_imc_cpu_offline(),
since cpumask_any_but() returns an 'int' value.

Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 23:05:01 +11:00
Anju T Sudhakar 684d984038 powerpc/powernv: Add debugfs interface for imc-mode and imc-command
In memory Collection (IMC) counter pmu driver controls the ucode's
execution state. At the system boot, IMC perf driver pause the ucode.
Ucode state is changed to "running" only when any of the nest units
are monitored or profiled using perf tool.

Nest units support only limited set of hardware counters and ucode is
always programmed in the "production mode" ("accumulation") mode. This
mode is configured to provide key performance metric data for most of
the nest units.

But ucode also supports other modes which would be used for "debug" to
drill down specific nest units. That is, ucode when switched to
"powerbus" debug mode (for example), will dynamically reconfigure the
nest counters to target only "powerbus" related events in the hardware
counters. This allows the IMC nest unit to focus on powerbus related
transactions in the system in more detail. At this point, production
mode events may or may not be counted.

IMC nest counters has both in-band (ucode access) and out of band
access to it. Since not all nest counter configurations are supported
by ucode, out of band tools are used to characterize other nest
counter configurations.

Patch provides an interface via "debugfs" to enable the switching of
ucode modes in the system. To switch ucode mode, one has to first
pause the microcode (imc_cmd), and then write the target mode value to
the "imc_mode" file.

Proposed Approach:

In the proposed approach, the function (export_imc_mode_and_cmd) which
creates the debugfs interface for imc mode and command is implemented
in opal-imc.c. Thus we can use imc_get_mem_addr() to get the homer
base address for each chip.

The interface to expose imc mode and command is required only if we
have nest pmu units registered. Employing the existing data structures
to track whether we have any nest units registered will require to
extend data from perf side to opal-imc.c. Instead an integer is
introduced to hold that information by counting successful nest unit
registration. Debugfs interface is removed based on the integer count.

Example for the interface:

  $ ls /sys/kernel/debug/imc
  imc_cmd_0  imc_cmd_8  imc_mode_0  imc_mode_8

Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 23:05:00 +11:00
Anju T Sudhakar 8b4e6deaff powerpc/perf: Pass struct imc_events as a parameter to imc_parse_event()
Remove the allocation of struct imc_events from imc_parse_event().
Instead pass imc_events as a parameter to imc_parse_event(), which is
a pointer to a slot in the array allocated in
update_events_in_group().

Reported-by: Dan Carpenter ("powerpc/perf: Fix a sizeof() typo so we allocate less memory")
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 23:04:44 +11:00
Anju T Sudhakar ed8e443fee powerpc/perf: IMC code cleanup with some code refactoring
Factor out memory freeing part for attribute elements from
imc_common_cpuhp_mem_free().

Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:43:18 +11:00
Anju T Sudhakar e7673818d9 powerpc/perf: Remove thread_imc_pmu global variable from
Remove the global variable 'thread_imc_pmu', since it is not used in the code.

Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Reviewed-by: madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:39:57 +11:00
Madhavan Srinivasan 6cd74d2b61 powerpc/64s: Implement local_t using irq soft masking
local_t is used for atomic modifications for per-CPU data, versus
re-entrant modifications via interrupts.

local_t read-modify-write atomic operations are currently implemented
with hardware atomics (larx/stcx), which are quite slow. This patch
implements them by masking all types of interrupts that may do local_t
operations ("standard" and perf interrupts).

Rusty's benchmark (https://lkml.org/lkml/2008/12/16/450) gives the
following timings for the local_t test, in nanoseconds per iteration:

             larx/stcx   irq+pmu disable
_inc                38                10
_add                38                10
_read                4                 4
_add_return         38                10

There are still some interrupt types (system reset, machine check, and
watchdog), which can not safely use local_t operations, because they
are not masked.

An alternative approach was proposed, using a CR bit to mark a critical
section, which is tested in the interrupt return path, and would then
branch to a fixup handler (similar to exception fixups), which re-starts
the operation. The problem with this was the complexity of the fixup
handler and the latency of the slow path.

https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-November/123024.html

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:04 +11:00
Madhavan Srinivasan 3b7e302086 powerpc: use generic atomic implementation for local_t
powerpc implements local_t with atomic operations. There is already
an asm-generic implementation which does this using atomic_t.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:04 +11:00
Madhavan Srinivasan c6424382de powerpc/64s: Add new set of irq_soft_mask_ functions for PMI masking
To support soft-masking of the performance monitor interrupt, a set of
new powerpc_local_irq_pmu_save() and powerpc_local_irq_restore()
functions are added. And powerpc_local_irq_save() implemented, by
adding a new irq_soft_mask manipulation function
irq_soft_mask_or_return().

Local_irq_pmu_* macros are provided to access these
powerpc_local_irq_pmu* functions which includes
trace_hardirqs_on|off() to match what we have in
include/linux/irqflags.h.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:03 +11:00
Madhavan Srinivasan 9aa88188ee powerpc: Add new kconfig CONFIG_PPC_IRQ_SOFT_MASK_DEBUG
New Kconfig is added "CONFIG_PPC_IRQ_SOFT_MASK_DEBUG" to add WARN_ON
to alert the invalid transitions. Also moved the code under the
CONFIG_TRACE_IRQFLAGS in arch_local_irq_restore() to new Kconfig.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[mpe: Fix name of CONFIG option in change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:03 +11:00
Madhavan Srinivasan f442d00480 powerpc/64s: Add support to mask perf interrupts and replay them
Two new bit mask field "IRQ_DISABLE_MASK_PMU" is introduced to support
the masking of PMI and "IRQ_DISABLE_MASK_ALL" to aid interrupt masking
checking.

Couple of new irq #defs "PACA_IRQ_PMI" and "SOFTEN_VALUE_0xf0*" added
to use in the exception code to check for PMI interrupts.

In the masked_interrupt handler, for PMIs we reset the MSR[EE] and
return. In the __check_irq_replay(), replay the PMI interrupt by
calling performance_monitor_common handler.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:02 +11:00
Madhavan Srinivasan f14e953b19 powerpc/64s: Add support to take additional parameter in MASKABLE_* macro
To support addition of "bitmask" to MASKABLE_* macros, factor out the
EXCPETION_PROLOG_1 macro.

Make it explicit the interrupt masking supported by a gievn interrupt
handler. Patch correspondingly extends the MASKABLE_* macros with an
addition's parameter. "bitmask" parameter is passed to SOFTEN_TEST
macro to decide on masking the interrupt.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:02 +11:00
Madhavan Srinivasan d32eb1b550 powerpc/64s: Avoid using EXCEPTION_PROLOG_1 macro in MASKABLE_*
Currently we use both EXCEPTION_PROLOG_1 and __EXCEPTION_PROLOG_1 in
the MASKABLE_* macros. As a cleanup, this patch makes MASKABLE_* to
use only __EXCEPTION_PROLOG_1. There is not logic change.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:01 +11:00
Madhavan Srinivasan 4e26bc4a4e powerpc/64: Rename soft_enabled to irq_soft_mask
Rename the paca->soft_enabled to paca->irq_soft_mask as it is no
longer used as a flag for interrupt state, but a mask.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:01 +11:00
Madhavan Srinivasan 01417c6cc7 powerpc/64: Change soft_enabled from flag to bitmask
"paca->soft_enabled" is used as a flag to mask some of interrupts.
Currently supported flags values and their details:

soft_enabled    MSR[EE]

0               0       Disabled (PMI and HMI not masked)
1               1       Enabled

"paca->soft_enabled" is initialized to 1 to make the interripts as
enabled. arch_local_irq_disable() will toggle the value when
interrupts needs to disbled. At this point, the interrupts are not
actually disabled, instead, interrupt vector has code to check for the
flag and mask it when it occurs. By "mask it", it update interrupt
paca->irq_happened and return. arch_local_irq_restore() is called to
re-enable interrupts, which checks and replays interrupts if any
occured.

Now, as mentioned, current logic doesnot mask "performance monitoring
interrupts" and PMIs are implemented as NMI. But this patchset depends
on local_irq_* for a successful local_* update. Meaning, mask all
possible interrupts during local_* update and replay them after the
update.

So the idea here is to reserve the "paca->soft_enabled" logic. New
values and details:

soft_enabled    MSR[EE]

1               0       Disabled  (PMI and HMI not masked)
0               1       Enabled

Reason for the this change is to create foundation for a third mask
value "0x2" for "soft_enabled" to add support to mask PMIs. When
->soft_enabled is set to a value "3", PMI interrupts are mask and when
set to a value of "1", PMI are not mask. With this patch also extends
soft_enabled as interrupt disable mask.

Current flags are renamed from IRQ_[EN?DIS}ABLED to
IRQS_ENABLED and IRQS_DISABLED.

Patch also fixes the ptrace call to force the user to see the softe
value to be alway 1. Reason being, even though userspace has no
business knowing about softe, it is part of pt_regs. Like-wise in
signal context.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:00 +11:00
Madhavan Srinivasan acb396d7c2 powerpc/64: Cleanup hard_irq_disable() macro
Minor cleanup to use helper function for manipulating
paca->soft_enabled variable.

Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:37:00 +11:00
Madhavan Srinivasan a67c543aac powerpc/64: Implement and use soft_enabled_set_return API
Add a new wrapper function, soft_enabled_set_return(), added to do the
paca->soft_enabled updates requiring a set-return.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:59 +11:00
Madhavan Srinivasan e0b5687bed powerpc/64: Implement and use soft_enabled_return API
Add a new wrapper function, soft_enabled_return(), added to return
paca->soft_enabled value.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:59 +11:00
Madhavan Srinivasan 0b63acf4a0 powerpc/64: Move set_soft_enabled() and rename
Move set_soft_enabled() from powerpc/kernel/irq.c to asm/hw_irq.c, to
encourage updates to paca->soft_enabled done via these access
function. Add "memory" clobber to hint compiler since
paca->soft_enabled memory is the target here.

Renaming it as soft_enabled_set() will make namespaces works better as
prefix than a postfix when new soft_enabled manipulation functions are
introduced.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:58 +11:00
Madhavan Srinivasan b5c1bd62c0 powerpc/64: Fix arch_local_irq_disable() prototype
In powerpc/64, the arch_local_irq_disable() function returns unsigned
long, which is not consistent with other architectures.

Move that set-return asm implementation into arch_local_irq_save(),
and make arch_local_irq_disable() return void, simplifying the
assembly.

Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:58 +11:00
Nicholas Piggin 9f83e00f4c powerpc/64: Improve inline asm in arch_local_irq_disable
arch_local_irq_disable is implemented strangely, with a temporary
output register being set to the desired soft_enabled value via an
immediate input, which is then used to store to memory. This is not
required, the immediate can be specified directly as a register input.

For simple cases at least, assembly is unchanged except register
mapping.

Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:57 +11:00
Madhavan Srinivasan c2e480ba82 powerpc/64: Add #defines for paca->soft_enabled flags
Two #defines IRQS_ENABLED and IRQS_DISABLED are added to be used when
updating paca->soft_enabled. Replace the hardcoded values used when
updating paca->soft_enabled with IRQ_(EN|DIS)ABLED #define. No logic
change.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:56 +11:00
Madhavan Srinivasan a8a4b03ab9 powerpc: Hard wire PT_SOFTE value to 1 in ptrace & signals
We have always had softe in pt_regs, and accessible via PT_SOFTE, even
though it is not userspace state.

The value userspace sees should always be 1, because we should never
be in userspace with interrupts soft disabled.

In a subsequent patch we will be changing the semantics of the kernel
softe value, so hard wire the value to 1 to retain the existing
semantics. As far as we know nothing ever looks at it, but better safe
than sorry.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[mpe: Split out of larger patch, write change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:54 +11:00
Michael Ellerman 7a074fc083 powerpc/64s: Fix ps3 build error due to tlbiel_all()
The recent changes to TLB handling broke the PS3 build:

  arch/powerpc/include/asm/book3s/64/tlbflush.h:30: undefined reference to `.hash__tlbiel_all'

Fix it by adding an fallback version of tlbiel_all() for non-native
builds. It should never be called, due to checks in callers so it
calls BUG(). We should probably clean it up further but this will
suffice for now.

Fixes: d4748276ae ("powerpc/64s: Improve local TLB flush for boot and MCE on POWER9")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-19 22:36:45 +11:00
Paul Mackerras 3214d01f13 KVM: PPC: Book3S: Provide information about hardware/firmware CVE workarounds
This adds a new ioctl, KVM_PPC_GET_CPU_CHAR, that gives userspace
information about the underlying machine's level of vulnerability
to the recently announced vulnerabilities CVE-2017-5715,
CVE-2017-5753 and CVE-2017-5754, and whether the machine provides
instructions to assist software to work around the vulnerabilities.

The ioctl returns two u64 words describing characteristics of the
CPU and required software behaviour respectively, plus two mask
words which indicate which bits have been filled in by the kernel,
for extensibility.  The bit definitions are the same as for the
new H_GET_CPU_CHARACTERISTICS hypercall.

There is also a new capability, KVM_CAP_PPC_GET_CPU_CHAR, which
indicates whether the new ioctl is available.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 15:17:01 +11:00
Benjamin Herrenschmidt 9b9b13a6d1 KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded
This works on top of the single escalation support. When in single
escalation, with this change, we will keep the escalation interrupt
disabled unless the VCPU is in H_CEDE (idle). In any other case, we
know the VCPU will be rescheduled and thus there is no need to take
escalation interrupts in the host whenever a guest interrupt fires.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Benjamin Herrenschmidt 35c2405efc KVM: PPC: Book3S HV: Make xive_pushed a byte, not a word
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Benjamin Herrenschmidt 2662efd050 KVM: PPC: Book3S HV: Check DR not IR to chose real vs virt mode MMIOs
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Benjamin Herrenschmidt 2267ea7661 KVM: PPC: Book3S HV: Don't use existing "prodded" flag for XIVE escalations
The prodded flag is only cleared at the beginning of H_CEDE,
so every time we have an escalation, we will cause the *next*
H_CEDE to return immediately.

Instead use a dedicated "irq_pending" flag to indicate that
a guest interrupt is pending for the VCPU. We don't reuse the
existing exception bitmap so as to avoid expensive atomic ops.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Benjamin Herrenschmidt bf4159da47 KVM: PPC: Book3S HV: Enable use of the new XIVE "single escalation" feature
That feature, provided by Power9 DD2.0 and later, when supported
by newer OPAL versions, allows us to sacrifice a queue (priority 7)
in favor of merging all the escalation interrupts of the queues
of a single VP into a single interrupt.

This reduces the number of host interrupts used up by KVM guests
especially when those guests use multiple priorities.

It will also enable a future change to control the masking of the
escalation interrupts more precisely to avoid spurious ones.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Benjamin Herrenschmidt c424c10823 KVM: PPC: Book3S HV: Add more info about XIVE queues in debugfs
Add details about enabled queues and escalation interrupts.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:10:21 +11:00
Paul Mackerras d27998185d Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next
This merges in the ppc-kvm topic branch of the powerpc tree to get
two patches which are prerequisites for the following patch series,
plus another patch which touches both powerpc and KVM code.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-19 12:09:57 +11:00
Nicholas Piggin c16bee4bde powerpc: define __ARCH_IRQ_EXIT_IRQS_DISABLED
powerpc calls irq_exit() with local irqs disabled, therefore it
can define __ARCH_IRQ_EXIT_IRQS_DISABLED.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 15:43:43 +11:00
Nicholas Piggin 47712a921b powerpc/watchdog: remove arch_trigger_cpumask_backtrace
The powerpc NMI IPIs may not be recoverable if they are taken in
some sections of code, and also there have been and still are issues
with taking NMIs (in KVM guest code, in firmware, etc) which makes them
a bit dangerous to use.

Generic code like softlockup detector and rcu stall detectors really
hammer on trigger_*_backtrace, which has lead to further problems
because we've implemented it with the NMI.

So stop providing NMI backtraces for now. Importantly, the powerpc code
uses NMI IPIs in crash/debug, and the SMP hardlockup watchdog. So if the
softlockup and rcu hang detection traces are not being printed because
the CPU is stuck with interrupts off, then the hard lockup watchdog
should get it with the NMI IPI.

Fixes: 2104180a53 ("powerpc/64s: implement arch-specific hardlockup watchdog")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 15:43:43 +11:00
Nicholas Piggin 1af19331a3 powerpc/64s: Relax PACA address limitations
Book3S PACA memory allocation is restricted by the RMA limit and also
must not take SLB faults when accessed in virtual mode. Currently a
fixed 256MB limit is used for this, which is imprecise and sub-optimal.

Update the paca allocation limits to use use the ppc64_rma_size for RMA
limit, and share the safe_stack_limit() that is currently used for stack
allocations that must not take virtual mode faults.

The safe_stack_limit() name is changed to ppc64_bolted_size() to match
ppc64_rma_size and some comments are updated. We also need to use
early_mmu_has_feature() because we are now calling this function prior
to the jump label patching that enables mmu_has_feature().

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Change mmu_has_feature() to early_mmu_has_feature()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 15:42:48 +11:00
Paul Mackerras d075745d89 KVM: PPC: Book3S HV: Improve handling of debug-trigger HMIs on POWER9
Hypervisor maintenance interrupts (HMIs) are generated by various
causes, signalled by bits in the hypervisor maintenance exception
register (HMER).  In most cases calling OPAL to handle the interrupt
is the correct thing to do, but the "debug trigger" HMIs signalled by
PPC bit 17 (bit 46) of HMER are used to invoke software workarounds
for hardware bugs, and OPAL does not have any code to handle this
cause.  The debug trigger HMI is used in POWER9 DD2.0 and DD2.1 chips
to work around a hardware bug in executing vector load instructions to
cache inhibited memory.  In POWER9 DD2.2 chips, it is generated when
conditions are detected relating to threads being in TM (transactional
memory) suspended mode when the core SMT configuration needs to be
reconfigured.

The kernel currently has code to detect the vector CI load condition,
but only when the HMI occurs in the host, not when it occurs in a
guest.  If a HMI occurs in the guest, it is always passed to OPAL, and
then we always re-sync the timebase, because the HMI cause might have
been a timebase error, for which OPAL would re-sync the timebase, thus
removing the timebase offset which KVM applied for the guest.  Since
we don't know what OPAL did, we don't know whether to subtract the
timebase offset from the timebase, so instead we re-sync the timebase.

This adds code to determine explicitly what the cause of a debug
trigger HMI will be.  This is based on a new device-tree property
under the CPU nodes called ibm,hmi-special-triggers, if it is
present, or otherwise based on the PVR (processor version register).
The handling of debug trigger HMIs is pulled out into a separate
function which can be called from the KVM guest exit code.  If this
function handles and clears the HMI, and no other HMI causes remain,
then we skip calling OPAL and we proceed to subtract the guest
timebase offset from the timebase.

The overall handling for HMIs that occur in the host (i.e. not in a
KVM guest) is largely unchanged, except that we now don't set the flag
for the vector CI load workaround on DD2.2 processors.

This also removes a BUG_ON in the KVM code.  BUG_ON is generally not
useful in KVM guest entry/exit code since it is difficult to handle
the resulting trap gracefully.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 15:31:25 +11:00
Paul Mackerras 00608e1f00 KVM: PPC: Book3S HV: Allow HPT and radix on the same core for POWER9 v2.2
POWER9 chip versions starting with "Nimbus" v2.2 can support running
with some threads of a core in HPT mode and others in radix mode.
This means that we don't have to prohibit independent-threads mode
when running a HPT guest on a radix host, and we don't have to do any
of the synchronization between threads that was introduced in commit
c01015091a ("KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix
hosts", 2017-10-19).

Rather than using up another CPU feature bit, we just do an
explicit test on the PVR (processor version register) at module
startup time to determine whether we have to take steps to avoid
having some threads in HPT mode and some in radix mode (so-called
"mixed mode").  We test for "Nimbus" (indicated by 0 or 1 in the top
nibble of the lower 16 bits) v2.2 or later, or "Cumulus" (indicated by
2 or 3 in that nibble) v1.1 or later.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-18 12:05:19 +11:00
Rob Herring 59f47eff03 powerpc/pci: Use of_irq_parse_and_map_pci() helper
Instead of calling both of_irq_parse_pci() and irq_create_of_mapping(),
call of_irq_parse_and_map_pci(), which does the same thing. This will allow
making of_irq_parse_pci() a private, static function.

This changes the logic slightly in that the fallback path will also be
taken if irq_create_of_mapping() fails internally.

Signed-off-by: Rob Herring <robh@kernel.org>
[bhelgaas: fold in virq init from Stephen Rothwell <sfr@canb.auug.org.au>]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
2018-01-17 17:54:37 -06:00
Nicholas Piggin c610d65c0a powerpc/pseries: lift RTAS limit for hash
With the previous patch to switch to 64-bit mode after returning from
RTAS and before doing any memory accesses, the RMA limit need not be
clamped to 1GB to avoid RTAS bugs.

Keep the 1GB limit for older firmware (although this is more of a kernel
concern than RTAS), and remove it starting with POWER9.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 00:45:34 +11:00
Nicholas Piggin 5eae82cab5 powerpc/pseries: lift RTAS limit for radix
With the previous patch to switch to 64-bit mode after returning from
RTAS and before doing any memory accesses, the RMA limit need not be
clamped to 1GB to avoid RTAS bugs.

Keep the 1GB limit for older firmware (although this is more of a kernel
concern than RTAS), and remove it starting with POWER9.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 00:44:42 +11:00
Nicholas Piggin 47fee31dbd powerpc/64: rtas avoid accessing paca in 32-bit mode
Commit 177ba7c647 ("powerpc/mm/radix: Limit paca allocation in radix")
limited the paca allocation address to 1G on pSeries because RTAS return
accesses the paca in 32-bit mode:

    On return from RTAS we access the paca variables and we have 64 bit
    disabled. This requires us to limit paca in 32 bit range.

    Fix this by setting ppc64_rma_size to first_memblock_size/1G range.

Avoid this limit by switching to 64-bit mode before accessing any memory.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 00:42:58 +11:00
Nicholas Piggin 98ae0069cb powerpc/pseries: radix is not subject to RMA limit, remove it
The radix guest is not subject to the paravirtualized HPT VRMA limit,
so remove that from ppc64_rma_size calculation for that platform.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 00:42:14 +11:00
Nicholas Piggin 1513c33d71 powerpc/powernv: Remove real mode access limit for early allocations
This removes the RMA limit on powernv platform, which constrains
early allocations such as PACAs and stacks. There are still other
restrictions that must be followed, such as bolted SLB limits, but
real mode addressing has no constraints.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 00:41:44 +11:00
Nicholas Piggin d4748276ae powerpc/64s: Improve local TLB flush for boot and MCE on POWER9
There are several cases outside the normal address space management
where a CPU's entire local TLB is to be flushed:

  1. Booting the kernel, in case something has left stale entries in
     the TLB (e.g., kexec).

  2. Machine check, to clean corrupted TLB entries.

One other place where the TLB is flushed, is waking from deep idle
states. The flush is a side-effect of calling ->cpu_restore with the
intention of re-setting various SPRs. The flush itself is unnecessary
because in the first case, the TLB should not acquire new corrupted
TLB entries as part of sleep/wake (though they may be lost).

This type of TLB flush is coded inflexibly, several times for each CPU
type, and they have a number of problems with ISA v3.0B:

- The current radix mode of the MMU is not taken into account, it is
  always done as a hash flushn For IS=2 (LPID-matching flush from host)
  and IS=3 with HV=0 (guest kernel flush), tlbie(l) is undefined if
  the R field does not match the current radix mode.

- ISA v3.0B hash must flush the partition and process table caches as
  well.

- ISA v3.0B radix must flush partition and process scoped translations,
  partition and process table caches, and also the page walk cache.

So consolidate the flushing code and implement it in C and inline asm
under the mm/ directory with the rest of the flush code. Add ISA v3.0B
cases for radix and hash, and use the radix flush in radix environment.

Provide a way for IS=2 (LPID flush) to specify the radix mode of the
partition. Have KVM pass in the radix mode of the guest.

Take out the flushes from early cputable/dt_cpu_ftrs detection hooks,
and move it later in the boot process after, the MMU registers are set
up and before relocation is first turned on.

The TLB flush is no longer called when restoring from deep idle states.
This was not be done as a separate step because booting secondaries
uses the same cpu_restore as idle restore, which needs the TLB flush.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 00:40:31 +11:00
Nicholas Piggin 4552d128c2 powerpc: System reset avoid interleaving oops using die synchronisation
The die() oops path contains a serializing lock to prevent oops
messages from being interleaved. In the case of a system reset
initiated oops (e.g., qemu nmi command), __die was being called
which lacks that synchronisation and oops reports could be
interleaved across CPUs.

A recent patch 4388c9b3a6 ("powerpc: Do not send system reset
request through the oops path") changed this to __die to avoid
the debugger() call, but there is no real harm to calling it twice
if the first time fell through. So go back to using die() here.
This was observed to fix the problem.

Fixes: 4388c9b3a6 ("powerpc: Do not send system reset request through the oops path")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-18 00:38:59 +11:00
Michal Suchanek 1b689a95ce powerpc/pseries: include linux/types.h in asm/hvcall.h
Commit 6e032b350c ("powerpc/powernv: Check device-tree for RFI flush
settings") uses u64 in asm/hvcall.h without including linux/types.h

This breaks hvcall.h users that do not include the header themselves.

Fixes: 6e032b350c ("powerpc/powernv: Check device-tree for RFI flush settings")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-17 23:30:46 +11:00
Michael Ellerman 236003e6b5 powerpc/64s: Allow control of RFI flush via debugfs
Expose the state of the RFI flush (enabled/disabled) via debugfs, and
allow it to be enabled/disabled at runtime.

eg: $ cat /sys/kernel/debug/powerpc/rfi_flush
    1
    $ echo 0 > /sys/kernel/debug/powerpc/rfi_flush
    $ cat /sys/kernel/debug/powerpc/rfi_flush
    0

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
2018-01-17 23:30:21 +11:00
Michael Ellerman fd6e440f20 powerpc/64s: Wire up cpu_show_meltdown()
The recent commit 87590ce6e3 ("sysfs/cpu: Add vulnerability folder")
added a generic folder and set of files for reporting information on
CPU vulnerabilities. One of those was for meltdown:

  /sys/devices/system/cpu/vulnerabilities/meltdown

This commit wires up that file for 64-bit Book3S powerpc.

For now we default to "Vulnerable" unless the RFI flush is enabled.
That may not actually be true on all hardware, further patches will
refine the reporting based on the CPU/platform etc. But for now we
default to being pessimists.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-17 23:30:20 +11:00
David S. Miller c02b3741eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Overlapping changes all over.

The mini-qdisc bits were a little bit tricky, however.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-17 00:10:42 -05:00
Paul Mackerras 6964e6a4e4 KVM: PPC: Book3S HV: Do SLB load/unload with guest LPCR value loaded
This moves the code that loads and unloads the guest SLB values so that
it is done while the guest LPCR value is loaded in the LPCR register.
The reason for doing this is that on POWER9, the behaviour of the
slbmte instruction depends on the LPCR[UPRT] bit.  If UPRT is 1, as
it is for a radix host (or guest), the SLB index is truncated to
2 bits.  This means that for a HPT guest on a radix host, the SLB
was not being loaded correctly, causing the guest to crash.

The SLB is now loaded much later in the guest entry path, after the
LPCR is loaded, which for a secondary thread is after it sees that
the primary thread has switched the MMU to the guest.  The loop that
waits for the primary thread has a branch out to the exit code that
is taken if it sees that other threads have commenced exiting the
guest.  Since we have now not loaded the SLB at this point, we make
this path branch to a new label 'guest_bypass' and we move the SLB
unload code to before this label.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-17 11:19:02 +11:00
Paul Mackerras 43ff3f6523 KVM: PPC: Book3S HV: Make sure we don't re-enter guest without XIVE loaded
This fixes a bug where it is possible to enter a guest on a POWER9
system without having the XIVE (interrupt controller) context loaded.
This can happen because we unload the XIVE context from the CPU
before doing the real-mode handling for machine checks.  After the
real-mode handler runs, it is possible that we re-enter the guest
via a fast path which does not load the XIVE context.

To fix this, we move the unloading of the XIVE context to come after
the real-mode machine check handler is called.

Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-17 11:18:48 +11:00
Benjamin Herrenschmidt 2271db20e4 powerpc: Use the TRAP macro whenever comparing a trap number
Trap numbers can have extra bits at the bottom that need to
be filtered out. There are a few cases where we don't do that.

It's possible that we got lucky but better safe than sorry.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:51:43 +11:00
Benjamin Herrenschmidt 872e2ae4bd powerpc: Remove useless EXC_COMMON_HV
The only difference between EXC_COMMON_HV and EXC_COMMON is that the
former adds "2" to the trap number which is supposed to represent the
fact that this is an "HV" interrupt which uses HSRR0/1.

However KVM is the only one who cares and it has its own separate macros.

In fact, we only have one user of EXC_COMMON_HV and it's for an
unknown interrupt case. All the other ones already using EXC_COMMON.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:51:42 +11:00
Benjamin Herrenschmidt f5abe14f88 powerpc/xive: Remove incorrect debug code
WORD2 if the TIMA isn't byte accessible and
isn't that useful to know about, take out the
pr_devel statement.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:48:04 +11:00
Benjamin Herrenschmidt fbadeb6bb1 powerpc: Cosmetic cleanup of cpuinfo_op
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:16 +11:00
Benjamin Herrenschmidt f5f563012a powerpc: Make newline in cpuinfo unconditional
We used to not put the newline between the CPU part and the summary
part on UP kernels. This is a rather pointless ifdef so take it out.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:15 +11:00
Benjamin Herrenschmidt bba9bc89b9 powerpc: Add aacraid and nvme to powernv_defconfig
These adapters can be found in a number of our systems, so let's
enable the corresponding drivers by default.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:15 +11:00
Christophe Leroy 4f94b2c746 powerpc/8xx: Use L1 entry APG to handle _PAGE_ACCESSED for CONFIG_SWAP
When CONFIG_SWAP is set, the TLB miss handlers have to also take
into account _PAGE_ACCESSED flag. At the moment it is done by
anding _PAGE_ACCESSED into _PAGE_PRESENT using 3 instructions.

This patch uses APG for handling _PAGE_ACCESSED, allowing to
just copy _PAGE_ACCESSED bit into APG field, hence reducing the
action to a single instruction.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:15 +11:00
Christophe Leroy de0f938739 powerpc/8xx: Remove _PAGE_USER and handle user access at PMD level
As Linux kernel separates KERNEL and USER address spaces, there is
therefore no need to flag USER access at page level.

Today, the 8xx TLB handlers already handle user access in the L1 entry
through Access Protection Groups, it is then natural to move the user
access handling at PMD level once _PAGE_NA allows to handle PAGE_NONE
protection without _PAGE_USER

In the mean time, as we free up one bit in the PTE, we can use it to
include SPS (page size flag) in the PTE and avoid handling it at every
TLB miss hence removing special handling based on compiled page size.

For _PAGE_EXEC, we rework it to use PP PTE bits, avoiding the copy
of _PAGE_EXEC bit into the L1 entry. Unfortunatly we are not
able to put it at the correct location as it conflicts with
NA/RO/RW bits for data entries.

Upper bits of APG in L1 entry overlap with PMD base address. In
order to avoid having to filter that out, we set up all groups so that
upper bits can have any value.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:14 +11:00
Christophe Leroy 351750331f powerpc/mm: Introduce _PAGE_NA
Today, PAGE_NONE is defined as a page not having _PAGE_USER.
In some circunstances, when the CPU supports it, it might be
better to be able to flag a page with NO ACCESS.

In a following patch, the 8xx will switch user access being flagged
in the PMD, therefore it will not be possible anymore to use
_PAGE_USER as a way to flag a page with no access.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:14 +11:00
Christophe Leroy 812fadcb94 powerpc/mm: extend _PAGE_PRIVILEGED to all CPUs
commit ac29c64089 ("powerpc/mm: Replace _PAGE_USER with
_PAGE_PRIVILEGED") introduced _PAGE_PRIVILEGED for BOOK3S/64

This patch generalises _PAGE_PRIVILEGED for all CPUs, allowing
to have either _PAGE_PRIVILEGED or _PAGE_USER or both.

PPC_8xx has a _PAGE_SHARED flag which is set for and only for
all non user pages. Lets rename it _PAGE_PRIVILEGED to remove
confusion as it has nothing to do with Linux shared pages.

On BookE, there's a _PAGE_BAP_SR which has to be set for kernel
pages: defining _PAGE_PRIVILEGED as _PAGE_BAP_SR will make
this generic

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:13 +11:00
Christophe Leroy 5f356497c3 powerpc/8xx: remove unused _PAGE_WRITETHRU
_PAGE_WRITETHRU is only used in:
* AMIGA_Z2RAM block driver which is never activated on powerPC
* Video/FB driver which is for PPC_PMAC

Therefore, no need to spend time in 8xx TLB miss handlers for
handling it.

And by removing it, we free up bit 20 which then avoids having
to clear it on each TLB miss.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:13 +11:00
Christophe Leroy cd99ddbea2 powerpc/8xx: Only perform perf counting when perf is in use.
In TLB miss handlers, updating the perf counter is only useful
when performing a perf analysis. As it has a noticeable overhead,
let's only do it when needed.

In order to do so, the exit of the miss handlers will be patched
when starting/stopping 'perf': the first register restore
instruction of each exit point will be replaced by a jump to
the counting code.

Once this is done, CONFIG_PPC_8xx_PERF_EVENT becomes useless as
this feature doesn't add any overhead.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:12 +11:00
Christophe Leroy bb9b5a8332 powerpc/8xx: remove EXCEPTION_PROLOG/EPILOG_0 and change r3 to r12
EXCEPTION_PROLOG_0 and EXCEPTION_EPILOG_0 were added some
time ago in order to regroup the two mtspr/mfspr to SCRATCH0 and
SCRATCH1 and the mfcr/mtcr in order to ease entry and exit of
function not using the full EXCEPTION_PROLOG.

Since then, the mfcr/mtcr has been taken out, hence just leaving
the two mtspr/mfspr in the macro.

In order to improve readability of the exception functions, we
remove those two macros and copy back the two mtspr/mfspr instead.

As r10 and r11 are used for SCRATCH0 and SCRATCH1, lets also use
r12 for SCRATCH2. It will also improve the readability/maintenance.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:12 +11:00
Christophe Leroy 2a45addd21 powerpc/8xx: Remove CPU6 ERRATA Workaround
CPU6 ERRATA affects only MPC860 revisions prior to C.0. Manufacturing
of those revisiosn was stopped in 1999-2000.
Therefore, it has been almost 20 years since this ERRATA has been
fixed in the silicon.

This patch removes the workaround for that ERRATA.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:12 +11:00
Christophe Leroy 9a3b849bfe powerpc/8xx: do not select CONFIG_PPC_LIB_RHEAP
Since commit 0e6e01ff69 ("CPM/QE: use genalloc to manage CPM/QE
muram"), rheap is not used anymore.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:11 +11:00
Balbir Singh 4145f35864 powernv/kdump: Fix cases where the kdump kernel can get HMI's
Certain HMI's such as malfunction error propagate through
all threads/core on the system. If a thread was offline
prior to us crashing the system and jumping to the kdump
kernel, bad things happen when it wakes up due to an HMI
in the kdump kernel.

There are several possible ways to solve this problem

1. Put the offline cores in a state such that they are
not woken up for machine check and HMI errors. This
does not work, since we might need to wake up offline
threads to handle TB errors
2. Ignore HMI errors, setup HMEER to mask HMI errors,
but this still leads the window open for any MCEs
and masking them for the duration of the dump might
be a concern
3. Wake up offline CPUs, as in send them to
crash_ipi_callback (not wake them up as in mark them
online as seen by the hotplug). kexec does a
wake_online_cpus() call, this patch does something
similar, but instead sends an IPI and forces them to
crash_ipi_callback()

This patch takes approach #3.

Care is taken to enable this only for powenv platforms
via crash_wake_offline (a global value set at setup
time). The crash code sends out IPI's to all CPU's
which then move to crash_ipi_callback and kexec_smp_wait().

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:11 +11:00
Balbir Singh 04b9c96eae powerpc/crash: Remove the test for cpu_online in the IPI callback
Our check was extra cautious, we've audited crash_send_ipi
and it sends an IPI only to online CPU's. Removal of this
check should have not functional impact on crash kdump.

Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:10 +11:00
Dmitry Torokhov 9625e69a38 powerpc: make use of for_each_node_by_type() instead of open-coding it
Instead of manually coding the loop with of_find_node_by_type(), let's
switch to the standard macro for iterating over nodes with given type.

Also fixed a couple of refcount leaks in the aforementioned loops.

Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:47:10 +11:00
Christophe Leroy 94d3084a0f powerpc/32s: Fix compile error with CONFIG_PPC_PTDUMP
This patch remove CONFIG_PPC_HTDUMP if not PPC_BOOK3S_64 to avoid
below compile failure on BOOK3S_32:

  In file included from arch/powerpc/mm/dump_hashpagetable.c:27:0:
  ./arch/powerpc/include/asm/plpar_wrappers.h: In function 'get_cede_latency_hint':
  ./arch/powerpc/include/asm/plpar_wrappers.h:27:2: error: implicit declaration of function 'get_lppaca' [-Werror=implicit-function-declaration]
  ...
  arch/powerpc/mm/dump_hashpagetable.c: At top level:
  arch/powerpc/mm/dump_hashpagetable.c:69:13: error: 'SLB_VSID_B' undeclared here (not in a function)
  ...
  arch/powerpc/mm/dump_hashpagetable.c:506:38: error: 'VMEMMAP_BASE' undeclared (first use in this function)
  arch/powerpc/mm/dump_hashpagetable.c:506:35: error: assignment makes integer from pointer without a cast [-Werror]

Fixes: dd5ac03e09 ("powerpc/mm: Fix page table dump build on non-Book3S")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Trim change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:35:02 +11:00
Nathan Fontenot 0c38ed6f6f powerpc/pseries: Enable support of ibm,dynamic-memory-v2
Add required bits to the architecture vector to enable support
of the ibm,dynamic-memory-v2 device tree property.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:26:30 +11:00
Nathan Fontenot 2b31e3aec1 powerpc/drmem: Add support for ibm, dynamic-memory-v2 property
The Power Hypervisor has introduced a new device tree format for
the property describing the dynamic reconfiguration LMBs for a system,
ibm,dynamic-memory-v2. This new format condenses the size of the
property, especially on large memory systems, by reporting sets
of LMBs that have the same properties (flags and associativity array
index).

This patch updates the powerpc/mm/drmem.c code to provide routines
that can parse the new device tree format during the walk_drmem_lmb*
routines used during boot, the creation of the LMB array, and updating
the device tree to create a new property in the proper format for
ibm,dynamic-memory-v2.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:26:29 +11:00
Nathan Fontenot 2c77721552 powerpc: Move of_drconf_cell struct to asm/drmem.h
Now that the powerpc code parses dynamic reconfiguration memory
LMB information from the LMB array and not the device tree
directly we can move the of_drconf_cell struct to drmem.h where
it fits better.

In addition, the struct is renamed to of_drconf_cell_v1 in
anticipation of upcoming support for version 2 of the dynamic
reconfiguration property and the members are typed as __be*
values to reflect how they exist in the device tree.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:26:29 +11:00
Nathan Fontenot 6195a5001f powerpc/pseries: Update memory hotplug code to use drmem LMB array
Update the pseries memory hotplug code to use the newly added
dynamic reconfiguration LMB array. Doing this is required for the
upcoming support of version 2 of the dynamic reconfiguration
device tree property.

In addition, making this change cleans up the code that parses the
LMB information as we no longer need to worry about device tree
format. This allows us to discard one of the first steps on memory
hotplug where we make a working copy of the device tree property and
convert the entire property to cpu format. Instead we just use the
LMB array directly while holding the memory hotplug lock.

This patch also moves the updating of the device tree property to
powerpc/mm/drmem.c. This allows to the hotplug code to work without
needing to know the device tree format and provides a single
routine for updating the device tree property. This new routine
will handle determination of the proper device tree format and
generate a properly formatted device tree property.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:26:28 +11:00
Nathan Fontenot 514a9cb331 powerpc/numa: Update numa code use walk_drmem_lmbs
Update code in powerpc/numa.c to use the walk_drmem_lmbs()
routine instead of parsing the device tree directly. This is
in anticipation of introducing a new ibm,dynamic-memory-v2
property with a different format. This will allow the numa code
to use a single initialization routine per-LMB irregardless of
the device tree format.

Additionally, to support additional routines in numa.c that need
to look up LMB information, an late_init routine is added to drmem.c
to allocate the array of LMB information. This LMB array will provide
per-LMB information to separate the LMB data from the device tree
format.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:26:28 +11:00
Nathan Fontenot 6c6ea53725 powerpc/mm: Separate ibm, dynamic-memory data from DT format
We currently have code to parse the dynamic reconfiguration LMB
information from the ibm,dynamic-meory device tree property in
multiple locations; numa.c, prom.c, and pseries/hotplug-memory.c.
In anticipation of adding support for a version 2 of the
ibm,dynamic-memory property this patch aims to separate the device
tree information from the device tree format.

Doing this requires a two step process to avoid a possibly very large
bootmem allocation early in boot. During initial boot, new routines
are provided to walk the device tree property and make a call-back
for each LMB.

The second step (introduced in later patches) will allocate an
array of LMB information that can be used directly without needing
to know the DT format.

This approach provides the benefit of consolidating the device tree
property parsing to a single location and (eventually) providing
a common data structure for retrieving LMB information.

This patch introduces a routine to walk the ibm,dynamic-memory
property in the flattened device tree and updates the prom.c code
to use this to initialize memory.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 23:26:27 +11:00
Nathan Fontenot b88fc309d6 powerpc/numa: Look up associativity array in of_drconf_to_nid_single
Look up the associativity arrays in of_drconf_to_nid_single when
deriving the nid for a LMB instead of having it passed in as a
parameter.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 18:23:48 +11:00
Nathan Fontenot 22508f3dc9 powerpc/numa: Look up device node in of_get_usable_memory()
Look up the device node for the usable memory property instead
of having it passed in as a parameter. This changes precedes an update
in which the calling routines for of_get_usable_memory() will not have
the device node pointer to pass in.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 18:23:48 +11:00
Nathan Fontenot 35f80debae powerpc/numa: Look up device node in of_get_assoc_arrays()
Look up the device node for the associativity array property instead
of having it passed in as a parameter. This changes precedes an update
in which the calling routines for of_get_assoc_arrays() will not have
the device node pointer to pass in.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-16 18:23:47 +11:00
Eric W. Biederman ea64d5acc8 signal: Unify and correct copy_siginfo_to_user32
Among the existing architecture specific versions of
copy_siginfo_to_user32 there are several different implementation
problems.  Some architectures fail to handle all of the cases in in
the siginfo union.  Some architectures perform a blind copy of the
siginfo union when the si_code is negative.  A blind copy suggests the
data is expected to be in 32bit siginfo format, which means that
receiving such a signal via signalfd won't work, or that the data is
in 64bit siginfo and the code is copying nonsense to userspace.

Create a single instance of copy_siginfo_to_user32 that all of the
architectures can share, and teach it to handle all of the cases in
the siginfo union correctly, with the assumption that siginfo is
stored internally to the kernel is 64bit siginfo format.

A special case is made for x86 x32 format.  This is needed as presence
of both x32 and ia32 on x86_64 results in two different 32bit signal
formats.  By allowing this small special case there winds up being
exactly one code base that needs to be maintained between all of the
architectures.  Vastly increasing the testing base and the chances of
finding bugs.

As the x86 copy of copy_siginfo_to_user32 the call of the x86
signal_compat_build_tests were moved into sigaction_compat_abi, so
that they will keep running.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 19:56:20 -06:00
Paul Mackerras 5855564c8a KVM: PPC: Book3S HV: Enable migration of decrementer register
This adds a register identifier for use with the one_reg interface
to allow the decrementer expiry time to be read and written by
userspace.  The decrementer expiry time is in guest timebase units
and is equal to the sum of the decrementer and the guest timebase.
(The expiry time is used rather than the decrementer value itself
because the expiry time is not constantly changing, though the
decrementer value is, while the guest vcpu is not running.)

Without this, a guest vcpu migrated to a new host will see its
decrementer set to some random value.  On POWER8 and earlier, the
decrementer is 32 bits wide and counts down at 512MHz, so the
guest vcpu will potentially see no decrementer interrupts for up
to about 4 seconds, which will lead to a stall.  With POWER9, the
decrementer is now 56 bits side, so the stall can be much longer
(up to 2.23 years) and more noticeable.

To help work around the problem in cases where userspace has not been
updated to migrate the decrementer expiry time, we now set the
default decrementer expiry at vcpu creation time to the current time
rather than the maximum possible value.  This should mean an
immediate decrementer interrupt when a migrated vcpu starts
running.  In cases where the decrementer is 32 bits wide and more
than 4 seconds elapse between the creation of the vcpu and when it
first runs, the decrementer would have wrapped around to positive
values and there may still be a stall - but this is no worse than
the current situation.  In the large-decrementer case, we are sure
to get an immediate decrementer interrupt (assuming the time from
vcpu creation to first run is less than 2.23 years) and we thus
avoid a very long stall.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-16 11:54:45 +11:00
Eric W. Biederman 212a36a17e signal: Unify and correct copy_siginfo_from_user32
The function copy_siginfo_from_user32 is used for two things, in ptrace
since the dawn of siginfo for arbirarily modifying a signal that
user space sees, and in sigqueueinfo to send a signal with arbirary
siginfo data.

Create a single copy of copy_siginfo_from_user32 that all architectures
share, and teach it to handle all of the cases in the siginfo union.

In the generic version of copy_siginfo_from_user32 ensure that all
of the fields in siginfo are initialized so that the siginfo structure
can be safely copied to userspace if necessary.

When copying the embedded sigval union copy the si_int member.  That
ensures the 32bit values passes through the kernel unchanged.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 17:55:59 -06:00
Eric W. Biederman ad2b1ab57d signal/powerpc: Remove redefinition of NSIGTRAP on powerpc
NSIGTRAP is 4 in the generic siginfo and powerpc just undefines
NSGTRAP and redefine it as 4.  That accomplishes nothing so remove
the duplication.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-15 17:42:33 -06:00
Al Viro b713da69e4 signal: unify compat_siginfo_t
--EWB Added #ifdef CONFIG_X86_X32_ABI to arch/x86/kernel/signal_compat.c
      Changed #ifdef CONFIG_X86_X32 to #ifdef CONFIG_X86_X32_ABI in
      linux/compat.h

      CONFIG_X86_X32 is set when the user requests X32 support.

      CONFIG_X86_X32_ABI is set when the user requests X32 support
      and the tool-chain has X32 allowing X32 support to be built.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2018-01-15 17:40:31 -06:00
Christoph Hellwig 7f2c8bbd32 swiotlb: rename swiotlb_free to swiotlb_exit
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2018-01-15 09:35:39 +01:00
Christoph Hellwig a37a3710f3 powerpc: rename swiotlb_dma_ops
We'll need that name for a generic implementation soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2018-01-15 09:35:26 +01:00
Christoph Hellwig 57bf5a8963 dma-mapping: clear harmful GFP_* flags in common code
Lift the code from x86 so that we behave consistently.  In the future we
should probably warn if any of these is set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
2018-01-15 09:34:55 +01:00
Linus Torvalds 6bb821193b powerpc fixes for 4.15 #7
One fix for an oops at boot if we take a hotplug interrupt before we are ready
 to handle it.
 
 The bulk is patches to implement mitigation for Meltdown, see the change logs
 for more details.
 
 Thanks to:
   Nicholas Piggin, Michael Neuling, Oliver O'Halloran, Jon Masters, Jose Ricardo
   Ziviani, David Gibson.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJaWXZ+ExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYBe
 ew/+KMp80VAIGSyFeYHLg8oClnQjKcEDMzlNoHo8WU6NwwNzk2XXBXNaGT7Osg7B
 Ny69R3/VRrGmi/2uule9OzF2eZt8BUmRn3cHDcUB5OwFjMysv4M2jc+OK0STQxLU
 F769rX+ZlkyAvQn4UxtNCUob+JDfVbE5Lurrx475ZGL1YHk6QqBGhlN/niXKtkBz
 upeBEpu4JgwsLLD9H7c4QQha9PkhxY51jtvxL24cgwDuKJKhTdxLahPlA3QmLafu
 FJxR9R0YzTj9Ivuae5yM7xRKr5wROTPtTRbG9sN7XOgFNrQ+LSgemXGrlV/fzdhV
 6iPn2hHs4bu45SOIM+Hhf7OwHTtnMetXmKo6NldlKHVCw1HwBLigRohnOTPjiHfZ
 P6brR4T/cglakBiE29+8T0UFqOPizEJiQRHnsoWzwIa/aTqCGEE3m3rgTxU6ovt5
 3JggC51ZRDFsjGvj+JFnxGxCjYC4tDbfBI28M+GuWZZPBKLkFnEk8PlSzYyHPW2p
 5uVA/UN4KHVGYUfJVUC+LMP+ZIOSEyEISKk7RuK0C0mwHSXcvxbeEefXbc18Xjai
 J/pl/c9/4CRnEbdPm305xBxKKmU6gWxWRK3KbRFsoTI/wp+Ryx5TjPpgAa5GnaLv
 nrtrfKs8skWfGMoDWH7+KILt0fgRR2x+91GTSxc5aaHtYBI=
 =UZHv
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.15-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "One fix for an oops at boot if we take a hotplug interrupt before we
  are ready to handle it.

  The bulk is patches to implement mitigation for Meltdown, see the
  change logs for more details.

  Thanks to: Nicholas Piggin, Michael Neuling, Oliver O'Halloran, Jon
  Masters, Jose Ricardo Ziviani, David Gibson"

* tag 'powerpc-4.15-7' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/powernv: Check device-tree for RFI flush settings
  powerpc/pseries: Query hypervisor for RFI flush settings
  powerpc/64s: Support disabling RFI flush with no_rfi_flush and nopti
  powerpc/64s: Add support for RFI flush of L1-D cache
  powerpc/64s: Convert slb_miss_common to use RFI_TO_USER/KERNEL
  powerpc/64: Convert fast_exception_return to use RFI_TO_USER/KERNEL
  powerpc/64: Convert the syscall exit path to use RFI_TO_USER/KERNEL
  powerpc/64s: Simple RFI macro conversions
  powerpc/64: Add macros for annotating the destination of rfid/hrfid
  powerpc/pseries: Add H_GET_CPU_CHARACTERISTICS flags & wrapper
  powerpc/pseries: Make RAS IRQ explicitly dependent on DLPAR WQ
2018-01-14 15:03:17 -08:00
Eric W. Biederman 2f82a46f66 signal: Remove _sys_private and _overrun_incr from struct compat_siginfo
We have never passed either field to or from userspace so just remove them.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-12 14:34:46 -06:00
Eric W. Biederman cf4674c46c signal/powerpc: Document conflicts with SI_USER and SIGFPE and SIGTRAP
Setting si_code to 0 results in a userspace seeing an si_code of 0.
This is the same si_code as SI_USER.  Posix and common sense requires
that SI_USER not be a signal specific si_code.  As such this use of 0
for the si_code is a pretty horribly broken ABI.

Further use of si_code == 0 guaranteed that copy_siginfo_to_user saw a
value of __SI_KILL and now sees a value of SIL_KILL with the result
that uid and pid fields are copied and which might copying the si_addr
field by accident but certainly not by design.  Making this a very
flakey implementation.

Utilizing FPE_FIXME and TRAP_FIXME, siginfo_layout() will now return
SIL_FAULT and the appropriate fields will be reliably copied.

Possible ABI fixes includee:
- Send the signal without siginfo
- Don't generate a signal
- Possibly assign and use an appropriate si_code
- Don't handle cases which can't happen
Cc: Paul Mackerras <paulus@samba.org>
Cc: Kumar Gala <kumar.gala@freescale.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc:  linuxppc-dev@lists.ozlabs.org
Ref: 9bad068c24d7 ("[PATCH] ppc32: support for e500 and 85xx")
Ref: 0ed70f6105ef ("PPC32: Provide proper siginfo information on various exceptions.")
History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-12 14:21:04 -06:00
Eric Biggers a208fa8f33 crypto: hash - annotate algorithms taking optional key
We need to consistently enforce that keyed hashes cannot be used without
setting the key.  To do this we need a reliable way to determine whether
a given hash algorithm is keyed or not.  AF_ALG currently does this by
checking for the presence of a ->setkey() method.  However, this is
actually slightly broken because the CRC-32 algorithms implement
->setkey() but can also be used without a key.  (The CRC-32 "key" is not
actually a cryptographic key but rather represents the initial state.
If not overridden, then a default initial state is used.)

Prepare to fix this by introducing a flag CRYPTO_ALG_OPTIONAL_KEY which
indicates that the algorithm has a ->setkey() method, but it is not
required to be called.  Then set it on all the CRC-32 algorithms.

The same also applies to the Adler-32 implementation in Lustre.

Also, the cryptd and mcryptd templates have to pass through the flag
from their underlying algorithm.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-01-12 23:03:35 +11:00
Benjamin Herrenschmidt 7f1c410da5 powerpc/xive: Add interrupt flag to disable automatic EOI
This will be used by KVM in order to keep escalation interrupts
in the non-EOI (masked) state after they fire. They will be
re-enabled directly in HW by KVM when needed.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-12 15:24:44 +11:00
Benjamin Herrenschmidt 12c1f339cd powerpc/xive: Move definition of ESB bits
From xive.h to xive-regs.h since it's a HW register definition
and it can be used from assembly

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-12 15:24:41 +11:00
Sinan Kaya 50ed5780d5 powerpc/PCI: Deprecate pci_get_bus_and_slot()
pci_get_bus_and_slot() is restrictive such that it assumes domain=0 as
where a PCI device is present. This restricts the device drivers to be
reused for other domain numbers.

Getting ready to remove pci_get_bus_and_slot() function in favor of
pci_get_domain_bus_and_slot().

Use pci_get_domain_bus_and_slot() with a domain number of 0 as the code
is not ready to consume multiple domains and existing code used domain
number 0.

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Bjorn Helgaas <helgaas@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-11 17:21:55 -06:00
Paolo Bonzini 0217690f88 PPC KVM fixes for 4.15
Four commits here, including two that were tagged but never merged.
 Three of them are for the HPT resizing code; two of those fix a
 user-triggerable use-after-free in the host, and one that fixes
 stale TLB entries in the guest.  The remaining commit fixes a bug
 causing PR KVM guests under PowerVM to fail to start.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJaVfPgAAoJEJ2a6ncsY3GfA10IANZMkwtIpqxGlsAeXKr5bWdl
 iXYD9ymb2/FOHBbg6v8Eh6Gb1ycjzXpXqn74/Y9TE4Ort7mdiH+W6kXYEsMqL8yg
 7Uwnj8DuWFuFxX0x0V4SJQzgdCnOefVcfoo/RnLUzmLsW0Vqtr3A1djM5iHlxFvv
 ntkNtGYPOoaHl6rjtfHTDfLWN/DzEJbaIU/0O1LIkBxPG4STzSXErAucLL46Pa/X
 NuPO2HfpxQiacHVG62iy89eJeAcraEAXnH5e6eVPRQQqh3DSIERMU6n6jXyZeMU5
 NWX8Qme3VGBpiJOiCGMvMrnJmQmMTSWTtkGljyaFy+vZWMqGZ6xJ3wIP+5t9d+Q=
 =dw6K
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-fixes-4.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-master

PPC KVM fixes for 4.15

Four commits here, including two that were tagged but never merged.
Three of them are for the HPT resizing code; two of those fix a
user-triggerable use-after-free in the host, and one that fixes
stale TLB entries in the guest.  The remaining commit fixes a bug
causing PR KVM guests under PowerVM to fail to start.
2018-01-11 14:07:27 +01:00
Markus Elfring 1627301020 KVM: PPC: Use seq_puts() in kvmppc_exit_timing_show()
A headline should be quickly put into a sequence. Thus use the
function "seq_puts" instead of "seq_printf" for this purpose.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-11 20:36:06 +11:00
Alexander Graf 81ceca05a4 KVM: PPC: Book3S HV: Remove vcpu->arch.dec usage
On Book3S in HV mode, we don't use the vcpu->arch.dec field at all.
Instead, all logic is built around vcpu->arch.dec_expires.

So let's remove the one remaining piece of code that was setting it.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-11 20:36:06 +11:00
Bjorn Helgaas fdabc3fe99 PCI: Add #defines for Completion Timeout Disable feature
Add #defines for the Completion Timeout Disable feature and use them.  No
functional change intended.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 16:12:18 -06:00
Christoph Hellwig 2d9d6f6c9e powerpc: rename dma_direct_ to dma_nommu_
We want to use the dma_direct_ namespace for a generic implementation,
so rename powerpc to the second best choice: dma_nommu_.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-10 16:41:14 +01:00
Christoph Hellwig b49efd7624 dma-mapping: move dma_mark_clean to dma-direct.h
And unlike the other helpers we don't require a <asm/dma-direct.h> as
this helper is a special case for ia64 only, and this keeps it as
simple as possible.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-10 16:41:12 +01:00
Christoph Hellwig ea8c64ace8 dma-mapping: move swiotlb arch helpers to a new header
phys_to_dma, dma_to_phys and dma_capable are helpers published by
architecture code for use of swiotlb and xen-swiotlb only.  Drivers are
not supposed to use these directly, but use the DMA API instead.

Move these to a new asm/dma-direct.h helper, included by a
linux/dma-direct.h wrapper that provides the default linear mapping
unless the architecture wants to override it.

In the MIPS case the existing dma-coherent.h is reused for now as
untangling it will take a bit of work.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Robin Murphy <robin.murphy@arm.com>
2018-01-10 16:40:54 +01:00
Benjamin Herrenschmidt 349524bc0d powerpc: Don't preempt_disable() in show_cpuinfo()
This causes warnings from cpufreq mutex code. This is also rather
unnecessary and ineffective. If we really want to prevent concurrent
unplug, we could take the unplug read lock but I don't see this being
critical.

Fixes: cd77b5ce20 ("powerpc/powernv/cpufreq: Fix the frequency read by /proc/cpuinfo")
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-11 01:36:50 +11:00
Michael Ellerman 2248fade96 powerpc/xmon: Don't print hashed pointers in paca dump
Remember when the biggest problem we had to worry about was hashed
pointers, those were the days.

These were missed in my earlier patch because they don't match "%p",
but the macro is hiding a "%p", so these all end up being hashed,
which is not what we want in xmon. Convert them to "%px".

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-11 01:17:24 +11:00
Michael Ellerman 274920a3ec powerpc/xmon: Add RFI flush related fields to paca dump
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-11 00:10:50 +11:00
Oliver O'Halloran 6e032b350c powerpc/powernv: Check device-tree for RFI flush settings
New device-tree properties are available which tell the hypervisor
settings related to the RFI flush. Use them to determine the
appropriate flush instruction to use, and whether the flush is
required.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 21:27:16 +11:00
Michael Neuling 8989d56878 powerpc/pseries: Query hypervisor for RFI flush settings
A new hypervisor call is available which tells the guest settings
related to the RFI flush. Use it to query the appropriate flush
instruction(s), and whether the flush is required.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 21:27:15 +11:00
Michael Ellerman bc9c9304a4 powerpc/64s: Support disabling RFI flush with no_rfi_flush and nopti
Because there may be some performance overhead of the RFI flush, add
kernel command line options to disable it.

We add a sensibly named 'no_rfi_flush' option, but we also hijack the
x86 option 'nopti'. The RFI flush is not the same as KPTI, but if we
see 'nopti' we can guess that the user is trying to avoid any overhead
of Meltdown mitigations, and it means we don't have to educate every
one about a different command line option.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 21:27:15 +11:00
Michael Ellerman aa8a5e0062 powerpc/64s: Add support for RFI flush of L1-D cache
On some CPUs we can prevent the Meltdown vulnerability by flushing the
L1-D cache on exit from kernel to user mode, and from hypervisor to
guest.

This is known to be the case on at least Power7, Power8 and Power9. At
this time we do not know the status of the vulnerability on other CPUs
such as the 970 (Apple G5), pasemi CPUs (AmigaOne X1000) or Freescale
CPUs. As more information comes to light we can enable this, or other
mechanisms on those CPUs.

The vulnerability occurs when the load of an architecturally
inaccessible memory region (eg. userspace load of kernel memory) is
speculatively executed to the point where its result can influence the
address of a subsequent speculatively executed load.

In order for that to happen, the first load must hit in the L1,
because before the load is sent to the L2 the permission check is
performed. Therefore if no kernel addresses hit in the L1 the
vulnerability can not occur. We can ensure that is the case by
flushing the L1 whenever we return to userspace. Similarly for
hypervisor vs guest.

In order to flush the L1-D cache on exit, we add a section of nops at
each (h)rfi location that returns to a lower privileged context, and
patch that with some sequence. Newer firmwares are able to advertise
to us that there is a special nop instruction that flushes the L1-D.
If we do not see that advertised, we fall back to doing a displacement
flush in software.

For guest kernels we support migration between some CPU versions, and
different CPUs may use different flush instructions. So that we are
prepared to migrate to a machine with a different flush instruction
activated, we may have to patch more than one flush instruction at
boot if the hypervisor tells us to.

In the end this patch is mostly the work of Nicholas Piggin and
Michael Ellerman. However a cast of thousands contributed to analysis
of the issue, earlier versions of the patch, back ports testing etc.
Many thanks to all of them.

Tested-by: Jon Masters <jcm@redhat.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 21:27:06 +11:00
David Gibson ecba8297aa KVM: PPC: Book3S HV: Always flush TLB in kvmppc_alloc_reset_hpt()
The KVM_PPC_ALLOCATE_HTAB ioctl(), implemented by kvmppc_alloc_reset_hpt()
is supposed to completely clear and reset a guest's Hashed Page Table (HPT)
allocating or re-allocating it if necessary.

In the case where an HPT of the right size already exists and it just
zeroes it, it forces a TLB flush on all guest CPUs, to remove any stale TLB
entries loaded from the old HPT.

However, that situation can arise when the HPT is resizing as well - or
even when switching from an RPT to HPT - so those cases need a TLB flush as
well.

So, move the TLB flush to trigger in all cases except for errors.

Cc: stable@vger.kernel.org # v4.10+
Fixes: f98a8bf9ee ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-10 20:45:41 +11:00
Alexey Kardashevskiy 6c7d47c33e KVM: PPC: Book3S PR: Fix WIMG handling under pHyp
Commit 96df226 ("KVM: PPC: Book3S PR: Preserve storage control bits")
added code to preserve WIMG bits but it missed 2 special cases:
- a magic page in kvmppc_mmu_book3s_64_xlate() and
- guest real mode in kvmppc_handle_pagefault().

For these ptes, WIMG was 0 and pHyp failed on these causing a guest to
stop in the very beginning at NIP=0x100 (due to bd9166ffe "KVM: PPC:
Book3S PR: Exit KVM on failed mapping").

According to LoPAPR v1.1 14.5.4.1.2 H_ENTER:

 The hypervisor checks that the WIMG bits within the PTE are appropriate
 for the physical page number else H_Parameter return. (For System Memory
 pages WIMG=0010, or, 1110 if the SAO option is enabled, and for IO pages
 WIMG=01**.)

This hence initializes WIMG to non-zero value HPTE_R_M (0x10), as expected
by pHyp.

[paulus@ozlabs.org - fix compile for 32-bit]

Cc: stable@vger.kernel.org # v4.11+
Fixes: 96df226 "KVM: PPC: Book3S PR: Preserve storage control bits"
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: Ruediger Oertel <ro@suse.de>
Reviewed-by: Greg Kurz <groug@kaod.org>
Tested-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2018-01-10 20:45:00 +11:00
David Howells 0500871f21 Construct init thread stack in the linker script rather than by union
Construct the init thread stack in the linker script rather than doing it
by means of a union so that ia64's init_task.c can be got rid of.

The following symbols are then made available from INIT_TASK_DATA() linker
script macro:

	init_thread_union
	init_stack

INIT_TASK_DATA() also expands the region to THREAD_SIZE to accommodate the
size of the init stack.  init_thread_union is given its own section so that
it can be placed into the stack space in the right order.  I'm assuming
that the ia64 ordering is correct and that the task_struct is first and the
thread_info second.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Will Deacon <will.deacon@arm.com> (arm64)
Tested-by: Palmer Dabbelt <palmer@sifive.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2018-01-09 23:21:02 +00:00
Nicholas Piggin c7305645eb powerpc/64s: Convert slb_miss_common to use RFI_TO_USER/KERNEL
In the SLB miss handler we may be returning to user or kernel. We need
to add a check early on and save the result in the cr4 register, and
then we bifurcate the return path based on that.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 03:07:33 +11:00
Nicholas Piggin a08f828cf4 powerpc/64: Convert fast_exception_return to use RFI_TO_USER/KERNEL
Similar to the syscall return path, in fast_exception_return we may be
returning to user or kernel context. We already have a test for that,
because we conditionally restore r13. So use that existing test and
branch, and bifurcate the return based on that.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 03:07:32 +11:00
Nicholas Piggin b8e90cb7bc powerpc/64: Convert the syscall exit path to use RFI_TO_USER/KERNEL
In the syscall exit path we may be returning to user or kernel
context. We already have a test for that, because we conditionally
restore r13. So use that existing test and branch, and bifurcate the
return based on that.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 03:07:31 +11:00
Nicholas Piggin 222f20f140 powerpc/64s: Simple RFI macro conversions
This commit does simple conversions of rfi/rfid to the new macros that
include the expected destination context. By simple we mean cases
where there is a single well known destination context, and it's
simply a matter of substituting the instruction for the appropriate
macro.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 03:07:30 +11:00
Nicholas Piggin 50e51c13b3 powerpc/64: Add macros for annotating the destination of rfid/hrfid
The rfid/hrfid ((Hypervisor) Return From Interrupt) instruction is
used for switching from the kernel to userspace, and from the
hypervisor to the guest kernel. However it can and is also used for
other transitions, eg. from real mode kernel code to virtual mode
kernel code, and it's not always clear from the code what the
destination context is.

To make it clearer when reading the code, add macros which encode the
expected destination context.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 03:07:30 +11:00
David S. Miller a0ce093180 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-01-09 10:37:00 -05:00
Christoph Hellwig c91a7a405b powerpc: remove unused flush_write_buffers definition
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-09 16:28:36 +01:00
Michael Ellerman a6978f405d Merge branch 'topic/ppc-kvm' into fixes
Merge the topic branch with share with the kvm-ppc tree. In this case
we need to share the definition of a new hypervisor call and
associated flags.
2018-01-10 02:24:34 +11:00
Michael Neuling 191eccb158 powerpc/pseries: Add H_GET_CPU_CHARACTERISTICS flags & wrapper
A new hypervisor call has been defined to communicate various
characteristics of the CPU to guests. Add definitions for the hcall
number, flags and a wrapper function.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-10 01:46:34 +11:00
Sergey Senozhatsky 5633e85b2c powerpc64: Add .opd based function descriptor dereference
We are moving towards separate kernel and module function descriptor
dereference callbacks. This patch enables it for powerpc64.

For pointers that belong to the kernel
-  Added __start_opd and __end_opd pointers, to track the kernel
   .opd section address range;

-  Added dereference_kernel_function_descriptor(). Now we
   will dereference only function pointers that are within
   [__start_opd, __end_opd);

For pointers that belong to a module
-  Added dereference_module_function_descriptor() to handle module
   function descriptor dereference. Now we will dereference only
   pointers that are within [module->opd.start, module->opd.end).

Link: http://lkml.kernel.org/r/20171109234830.5067-4-sergey.senozhatsky@gmail.com
To: Tony Luck <tony.luck@intel.com>
To: Fenghua Yu <fenghua.yu@intel.com>
To: Helge Deller <deller@gmx.de>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To: Paul Mackerras <paulus@samba.org>
To: Michael Ellerman <mpe@ellerman.id.au>
To: James Bottomley <jejb@parisc-linux.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-ia64@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Tested-by: Santosh Sivaraj <santosh@fossix.org> #powerpc
Signed-off-by: Petr Mladek <pmladek@suse.com>
2018-01-09 10:45:37 +01:00
Christoph Hellwig a8fc357b28 mm: split altmap memory map allocation from normal case
No functional changes, just untangling the call chain and document
why the altmap is passed around the hotplug code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig 24b6d41643 mm: pass the vmem_altmap to vmemmap_free
We can just pass this on instead of having to do a radix tree lookup
without proper locking a few levels into the callchain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig da024512a1 mm: pass the vmem_altmap to arch_remove_memory and __remove_pages
We can just pass this on instead of having to do a radix tree lookup
without proper locking 2 levels into the callchain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig 7b73d978a5 mm: pass the vmem_altmap to vmemmap_populate
We can just pass this on instead of having to do a radix tree lookup
without proper locking a few levels into the callchain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Christoph Hellwig 24e6d5a59a mm: pass the vmem_altmap to arch_add_memory and __add_pages
We can just pass this on instead of having to do a radix tree lookup
without proper locking 2 levels into the callchain.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-01-08 11:46:23 -08:00
Michael Ellerman e2d5915293 powerpc/pseries: Make RAS IRQ explicitly dependent on DLPAR WQ
The hotplug code uses its own workqueue to handle IRQ requests
(pseries_hp_wq), however that workqueue is initialized after
init_ras_IRQ(). That can lead to a kernel panic if any hotplug
interrupts fire after init_ras_IRQ() but before pseries_hp_wq is
initialised. eg:

  UDP-Lite hash table entries: 2048 (order: 0, 65536 bytes)
  NET: Registered protocol family 1
  Unpacking initramfs...
  (qemu) object_add memory-backend-ram,id=mem1,size=10G
  (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
  Unable to handle kernel paging request for data at address 0xf94d03007c421378
  Faulting instruction address: 0xc00000000012d744
  Oops: Kernel access of bad area, sig: 11 [#1]
  LE SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.15.0-rc2-ziviani+ #26
  task:         (ptrval) task.stack:         (ptrval)
  NIP:  c00000000012d744 LR: c00000000012d744 CTR: 0000000000000000
  REGS:         (ptrval) TRAP: 0380   Not tainted  (4.15.0-rc2-ziviani+)
  MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28088042  XER: 20040000
  CFAR: c00000000012d3c4 SOFTE: 0
  ...
  NIP [c00000000012d744] __queue_work+0xd4/0x5c0
  LR [c00000000012d744] __queue_work+0xd4/0x5c0
  Call Trace:
  [c0000000fffefb90] [c00000000012d744] __queue_work+0xd4/0x5c0 (unreliable)
  [c0000000fffefc70] [c00000000012dce4] queue_work_on+0xb4/0xf0

This commit makes the RAS IRQ registration explicitly dependent on the
creation of the pseries_hp_wq.

Reported-by: Min Deng <mdeng@redhat.com>
Reported-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
Tested-by: Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2018-01-08 14:54:32 +11:00
Linus Torvalds 3219e264b9 powerpc fixes for 4.15 #6
Just one fix to correctly return SEGV_ACCERR when we take a SEGV on a mapped
 region. The bug was introduced in the refactoring of the page fault handler we
 did in the previous release.
 
 Thanks to:
   John Sperbeck.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJaULvEExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYBN
 hw//W6NeKl1iS7Xf5tcFX97I/TBakl0rS7gHGBMheQT4IGengQ3dJqfCMdq6nfyx
 Sss72UG1sVfeYNG5djJjoZ3hmnt1CjcMkRlQotUhFYACufHYiI8DwYlUTNYpQR6f
 z6uaItK2S5JWSzuk8wOe/VUmBqyIfe+LIWplh8uGy1vn93avbyHrJAAtPFeSyiBm
 TA6EMubY4n1NpNWyGIWBILO7e1yI0xT4jctwNy/ZAGC0lgutFb4sWY/ZxgYlQyKo
 fxb0Al8REpY73IjbZbSzcZ1GfdzDztda1fCNyUeKShRInSJTp31zasn4YCXzYOU8
 8yLw5DcnlA9Fyy7BV0IuFtAfV4wUHS9NDe8ebX6xKXarurTCwugoSbmHCQ8E7jIC
 4FFVhArQdraY+tumOwouJA7g4nGUtGV6rpZAUnd++7xVvFspJiAbFpbU2vUNnnJ4
 VoU2lvWjox9r6wxT001Ct/4M+XoR8+nnEKs8bll1771CyV+AQ4fGqoDga3dOB2cC
 M1ejwLFZ80ZnXDUY6wc4Wzor3G1knVRzuRLEcsoAe4vJunGsS1i9tYce9bOJ9la+
 okcmoPm0roPJSiT1bmiptIAsJRjZZq2cxr2+lBBQ9zlNuZyEIY/CwrBM/0ZP6RJI
 ljbjOCj0xvJBkmIBSenOVO/tIBi/Ww+wL2MDzsYv+K1pp24=
 =VAzk
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.15-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fix from Michael Ellerman:
 "Just one fix to correctly return SEGV_ACCERR when we take a SEGV on a
  mapped region. The bug was introduced in the refactoring of the page
  fault handler we did in the previous release.

  Thanks to John Sperbeck"

* tag 'powerpc-4.15-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm: Fix SEGV on mapped region to return SEGV_ACCERR
2018-01-06 09:48:27 -08:00
Stephen Boyd e0af0c1610 arch: Remove clkdev.h asm-generic from Kbuild
Now that every architecture is using the generic clkdev.h file
and we no longer include asm/clkdev.h anywhere in the tree, we
can remove it.

Cc: Russell King <linux@armlinux.org.uk>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: <linux-arch@vger.kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
2018-01-03 09:02:11 -08:00
John Sperbeck ecb101aed8 powerpc/mm: Fix SEGV on mapped region to return SEGV_ACCERR
The recent refactoring of the powerpc page fault handler in commit
c3350602e8 ("powerpc/mm: Make bad_area* helper functions") caused
access to protected memory regions to indicate SEGV_MAPERR instead of
the traditional SEGV_ACCERR in the si_code field of a user-space
signal handler. This can confuse debug libraries that temporarily
change the protection of memory regions, and expect to use SEGV_ACCERR
as an indication to restore access to a region.

This commit restores the previous behavior. The following program
exhibits the issue:

    $ ./repro read  || echo "FAILED"
    $ ./repro write || echo "FAILED"
    $ ./repro exec  || echo "FAILED"

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>
    #include <signal.h>
    #include <sys/mman.h>
    #include <assert.h>

    static void segv_handler(int n, siginfo_t *info, void *arg) {
            _exit(info->si_code == SEGV_ACCERR ? 0 : 1);
    }

    int main(int argc, char **argv)
    {
            void *p = NULL;
            struct sigaction act = {
                    .sa_sigaction = segv_handler,
                    .sa_flags = SA_SIGINFO,
            };

            assert(argc == 2);
            p = mmap(NULL, getpagesize(),
                    (strcmp(argv[1], "write") == 0) ? PROT_READ : 0,
                    MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
            assert(p != MAP_FAILED);

            assert(sigaction(SIGSEGV, &act, NULL) == 0);
            if (strcmp(argv[1], "read") == 0)
                    printf("%c", *(unsigned char *)p);
            else if (strcmp(argv[1], "write") == 0)
                    *(unsigned char *)p = 0;
            else if (strcmp(argv[1], "exec") == 0)
                    ((void (*)(void))p)();
            return 1;  /* failed to generate SEGV */
    }

Fixes: c3350602e8 ("powerpc/mm: Make bad_area* helper functions")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: John Sperbeck <jsperbeck@google.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Add commit references in change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-01-02 21:12:33 +11:00
David S. Miller 6bb8824732 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
net/ipv6/ip6_gre.c is a case of parallel adds.

include/trace/events/tcp.h is a little bit more tricky.  The removal
of in-trace-macro ifdefs in 'net' paralleled with moving
show_tcp_state_name and friends over to include/trace/events/sock.h
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-29 15:42:26 -05:00
Andrew Lunn 39c3fd5895 kernel/irq: Extend lockdep class for request mutex
The IRQ code already has support for lockdep class for the lock mutex
in an interrupt descriptor. Extend this to add a second class for the
request mutex in the descriptor. Not having a class is resulting in
false positive splats in some code paths.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: linus.walleij@linaro.org
Cc: grygorii.strashko@ti.com
Cc: f.fainelli@gmail.com
Link: https://lkml.kernel.org/r/1512234664-21555-1-git-send-email-andrew@lunn.ch
2017-12-28 12:26:35 +01:00
Linus Torvalds caf9a82657 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 PTI preparatory patches from Thomas Gleixner:
 "Todays Advent calendar window contains twentyfour easy to digest
  patches. The original plan was to have twenty three matching the date,
  but a late fixup made that moot.

   - Move the cpu_entry_area mapping out of the fixmap into a separate
     address space. That's necessary because the fixmap becomes too big
     with NRCPUS=8192 and this caused already subtle and hard to
     diagnose failures.

     The top most patch is fresh from today and cures a brain slip of
     that tall grumpy german greybeard, who ignored the intricacies of
     32bit wraparounds.

   - Limit the number of CPUs on 32bit to 64. That's insane big already,
     but at least it's small enough to prevent address space issues with
     the cpu_entry_area map, which have been observed and debugged with
     the fixmap code

   - A few TLB flush fixes in various places plus documentation which of
     the TLB functions should be used for what.

   - Rename the SYSENTER stack to CPU_ENTRY_AREA stack as it is used for
     more than sysenter now and keeping the name makes backtraces
     confusing.

   - Prevent LDT inheritance on exec() by moving it to arch_dup_mmap(),
     which is only invoked on fork().

   - Make vysycall more robust.

   - A few fixes and cleanups of the debug_pagetables code. Check
     PAGE_PRESENT instead of checking the PTE for 0 and a cleanup of the
     C89 initialization of the address hint array which already was out
     of sync with the index enums.

   - Move the ESPFIX init to a different place to prepare for PTI.

   - Several code moves with no functional change to make PTI
     integration simpler and header files less convoluted.

   - Documentation fixes and clarifications"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
  x86/cpu_entry_area: Prevent wraparound in setup_cpu_entry_area_ptes() on 32bit
  init: Invoke init_espfix_bsp() from mm_init()
  x86/cpu_entry_area: Move it out of the fixmap
  x86/cpu_entry_area: Move it to a separate unit
  x86/mm: Create asm/invpcid.h
  x86/mm: Put MMU to hardware ASID translation in one place
  x86/mm: Remove hard-coded ASID limit checks
  x86/mm: Move the CR3 construction functions to tlbflush.h
  x86/mm: Add comments to clarify which TLB-flush functions are supposed to flush what
  x86/mm: Remove superfluous barriers
  x86/mm: Use __flush_tlb_one() for kernel memory
  x86/microcode: Dont abuse the TLB-flush interface
  x86/uv: Use the right TLB-flush API
  x86/entry: Rename SYSENTER_stack to CPU_ENTRY_AREA_entry_stack
  x86/doc: Remove obvious weirdnesses from the x86 MM layout documentation
  x86/mm/64: Improve the memory map documentation
  x86/ldt: Prevent LDT inheritance on exec
  x86/ldt: Rework locking
  arch, mm: Allow arch_dup_mmap() to fail
  x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode
  ...
2017-12-23 11:53:04 -08:00
Linus Torvalds 9c294ec084 powerpc fixes for 4.15 #5
Of note is two fixes for KVM XIVE (Power9 interrupt controller). These would
 normally go via the KVM tree but Paul is away so I've picked them up.
 
 Other than that, two fixes for error handling in the IMC driver, and one for a
 potential oops in the BHRB code if the hardware records a branch address that
 has subsequently been unmapped, and finally a s/%p/%px/ in our oops code.
 
 Thanks to:
   Anju T Sudhakar, Cédric Le Goater, Laurent Vivier, Madhavan Srinivasan, Naveen
   N. Rao, Ravi Bangoria.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJaPNx6ExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYBm
 Dw/+K2DRM23L4I1OD+i71N0F9DIxoS95FhIheqnidJxWfff+sFyRhL1IQa6AUTfv
 9vLGUQ6IcqmrzyiHClewRVsX0DeXB1mYpoCBIqhgyL1cspkp+cP7DubpaeB1wXpQ
 vlq2VL6ZfeRAGvMykLIoE/xXtfVx8CuaAjY9AUIFvRRP4vupcpbl503cHEXmhaP9
 GaV+8poslwbxYf9ZPucPJVg4dxmT2dEb/xiZ6lLTDt3QXZx3abnFWYXhxGkdGhpt
 yPszkE3cDlypsa2nPfotEby4ThE9D4Ypxk1unSQfcFkaVjKAwwQ9MDED8E1NpEH5
 hqxmYoUNqLcftcxSZHX93acyHgKfvfM69i/vN7YwjhMEISdSDYCTaDrkxv5ntK4S
 A3FncuApqYPMRtFi+8O4AinUS2t2KkdLYckP1bXC++++F9wRth3iifK4QTj6cV9u
 V4aAPWvNSTgye0lokcwQF2KVdfdku9pl/85bclKddwGa1byscvNvCVPKuexoR3fM
 /PSNgzOizTMiAkuEO4WYmmuNNziSUjIMEWTfO4jIi2jKhuxg+s6hPg7SYN+iyQ/T
 il4b/fjsX6snXtwzxH2Xjche3c0UIN8UfgEkgKO21gbdrr7Ec6IIzkdgwu2jMHnt
 fEzUPYtW0vH9OKRqgKkY+YHYsBXNXu+pFUAu2jaG3KfPSWE=
 =d5wh
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.15-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "This is all fairly boring, except that there's two KVM fixes that
  you'd normally get via Paul's kvm-ppc tree. He's away so I picked them
  up. I was waiting to see if he would apply them, which is why they
  have only been in my tree since today. But they were on the list for a
  while and have been tested on the relevant hardware.

  Of note is two fixes for KVM XIVE (Power9 interrupt controller). These
  would normally go via the KVM tree but Paul is away so I've picked
  them up.

  Other than that, two fixes for error handling in the IMC driver, and
  one for a potential oops in the BHRB code if the hardware records a
  branch address that has subsequently been unmapped, and finally a
  s/%p/%px/ in our oops code.

  Thanks to: Anju T Sudhakar, Cédric Le Goater, Laurent Vivier, Madhavan
  Srinivasan, Naveen N. Rao, Ravi Bangoria"

* tag 'powerpc-4.15-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  KVM: PPC: Book3S HV: Fix pending_pri value in kvmppc_xive_get_icp()
  KVM: PPC: Book3S: fix XIVE migration of pending interrupts
  powerpc/kernel: Print actual address of regs when oopsing
  powerpc/perf: Fix kfree memory allocated for nest pmus
  powerpc/perf/imc: Fix nest-imc cpuhotplug callback failure
  powerpc/perf: Dereference BHRB entries safely
2017-12-22 12:38:30 -08:00
Thomas Gleixner c10e83f598 arch, mm: Allow arch_dup_mmap() to fail
In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
allowed to fail. Fix up all instances.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: dan.j.williams@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-12-22 20:13:01 +01:00
David S. Miller fba961ab29 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of overlapping changes.  Also on the net-next side
the XDP state management is handled more in the generic
layers so undo the 'net' nfp fix which isn't applicable
in net-next.

Include a necessary change by Jakub Kicinski, with log message:

====================
cls_bpf no longer takes care of offload tracking.  Make sure
netdevsim performs necessary checks.  This fixes a warning
caused by TC trying to remove a filter it has not added.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-22 11:16:31 -05:00
Aneesh Kumar K.V 5769beaf18 powerpc/mm: Add proper pte access check helper for other platforms
pte_access_premitted get called in get_user_pages_fast path. If we
have marked the pte PROT_NONE, we should not allow a read access on
the address. With the current implementation we are not checking the
READ and only check for WRITE. This is needed on archs like ppc64 that
implement PROT_NONE using _PAGE_USER access instead of _PAGE_PRESENT.
Also add pte_user check just to make sure we are not accessing kernel
mapping.

Even though there is code duplication, keeping the low level pte
accessors different for different platforms helps in code readability.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-22 22:28:40 +11:00
Aneesh Kumar K.V f72a85e347 powerpc/mm/book3s/64: Add proper pte access check helper
pte_access_premitted get called in get_user_pages_fast path. If we
have marked the pte PROT_NONE, we should not allow a read access on
the address. With the current implementation we are not checking the
READ and only check for WRITE. This is needed on archs like ppc64 that
implement PROT_NONE using RWX access instead of _PAGE_PRESENT. Also
add pte_user check just to make sure we are not accessing kernel
mapping.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-22 22:28:35 +11:00
Aneesh Kumar K.V 5fa5b16be5 powerpc/mm/hugetlb: Use pte_access_permitted for hugetlb access check
No functional change in this patch. This update gup_hugepte to use the
helper. This will help later when we add memory keys.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-22 22:28:31 +11:00
Laurent Vivier 7333b5aca4 KVM: PPC: Book3S HV: Fix pending_pri value in kvmppc_xive_get_icp()
When we migrate a VM from a POWER8 host (XICS) to a POWER9 host
(XICS-on-XIVE), we have an error:

qemu-kvm: Unable to restore KVM interrupt controller state \
          (0xff000000) for CPU 0: Invalid argument

This is because kvmppc_xics_set_icp() checks the new state
is internaly consistent, and especially:

...
   1129         if (xisr == 0) {
   1130                 if (pending_pri != 0xff)
   1131                         return -EINVAL;
...

On the other side, kvmppc_xive_get_icp() doesn't set
neither the pending_pri value, nor the xisr value (set to 0)
(and kvmppc_xive_set_icp() ignores the pending_pri value)

As xisr is 0, pending_pri must be set to 0xff.

Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-22 15:36:24 +11:00
Cédric Le Goater dc1c4165d1 KVM: PPC: Book3S: fix XIVE migration of pending interrupts
When restoring a pending interrupt, we are setting the Q bit to force
a retrigger in xive_finish_unmask(). But we also need to force an EOI
in this case to reach the same initial state : P=1, Q=0.

This can be done by not setting 'old_p' for pending interrupts which
will inform xive_finish_unmask() that an EOI needs to be sent.

Fixes: 5af5099385 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.12+
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-22 15:34:02 +11:00
Ram Pai 7e4363550c powerpc: capture the PTE format changes in the dump pte report
The H_PAGE_F_SECOND,H_PAGE_F_GIX are not in the 64K main-PTE.
capture these changes in the dump pte report.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:26 +11:00
Ram Pai a854868646 powerpc: use helper functions to get and set hash slots
replace redundant code in __hash_page_4K() and flush_hash_page()
with helper functions pte_get_hash_gslot() and pte_set_hidx()

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:25 +11:00
Ram Pai 273b493689 powerpc: Swizzle around 4K PTE bits to free up bit 5 and bit 6
We need PTE bits 3 ,4, 5, 6 and 57 to support protection-keys,
because these are the bits we want to consolidate on across all
configuration to support protection keys.

Bit 3,4,5 and 6 are currently used on 4K-pte kernels. But bit 9
and 10 are available. Hence we use the two available bits and
free up bit 5 and 6. We will still not be able to free up bit 3
and 4. In the absence of any other free bits, we will have to
stay satisfied with what we have :-(. This means we will not
be able to support 32 protection keys, but only 8. The bit
numbers are big-endian as defined in the ISA3.0

This patch does the following change to 4K PTE.

H_PAGE_F_SECOND (S) which occupied bit 4 moves to bit 7.
H_PAGE_F_GIX (G,I,X) which occupied bit 5, 6 and 7 also moves
to bit 8,9, 10 respectively.
H_PAGE_HASHPTE (H) which occupied bit 8 moves to bit 4.

Before the patch, the 4k PTE format was as follows

 0 1 2 3 4  5  6  7  8 9 10....................57.....63
 : : : : :  :  :  :  : : :                      :     :
 v v v v v  v  v  v  v v v                      v     v
,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x|B|S |G |I |X |H| | |x|x|................| |x|x|x|
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'

After the patch, the 4k PTE format is as follows

 0 1 2 3 4  5  6  7  8 9 10....................57.....63
 : : : : :  :  :  :  : : :                      :     :
 v v v v v  v  v  v  v v v                      v     v
,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x|B|H |  |  |S |G|I|X|x|x|................| |.|.|.|
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'

The patch has no code changes; just swizzles around bits.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:24 +11:00
Ram Pai 7b84947cad powerpc: shifted-by-one hidx value
0xf is considered invalid hidx value. It indicates absence of a backing
HPTE. A PTE is initialized to 0xf either
a) when it is new it is newly allocated to hold 4k-backing-HPTE
	or
b) Any time it gets demoted to a 4k-backing-HPTE

This patch shifts the representation by one-modulo-0xf; i.e hidx 0 is
represented as 1, 1 as 2,... , and 0xf as 0. This convention lets us
initialize the secondary-part of the PTE to all zeroes. PTEs are anyway
zero'd when allocated. We do not have to zero them again; thus saving on
the initialization.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:23 +11:00
Ram Pai bf9a95f9a6 powerpc: Free up four 64K PTE bits in 64K backed HPTE pages
Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6
in the 64K backed HPTE pages. This along with the earlier
patch will entirely free up the four bits from 64K PTE.
The bit numbers are big-endian as defined in the ISA3.0

This patch does the following change to 64K PTE backed
by 64K HPTE.

H_PAGE_F_SECOND (S) which occupied bit 4 moves to the
	second part of the pte to bit 60.
H_PAGE_F_GIX (G,I,X) which occupied bit 5, 6 and 7 also
	moves to the second part of the pte to bit 61,
 	62, 63, 64 respectively

since bit 7 is now freed up, we move H_PAGE_BUSY (B) from
bit 9 to bit 7.

The second part of the PTE will hold
(H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.
NOTE: None of the bits in the secondary PTE were not used
by 64k-HPTE backed PTE.

Before the patch, the 64K HPTE backed 64k PTE format was
as follows

 0 1 2 3 4  5  6  7  8 9 10...........................63
 : : : : :  :  :  :  : : :                            :
 v v v v v  v  v  v  v v v                            v

,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x| |S |G |I |X |x|B| |x|x|................|x|x|x|x| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
| | | | |  |  |  |  | | | | |..................| | | | | <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'

After the patch, the 64k HPTE backed 64k PTE format is
as follows

 0 1 2 3 4  5  6  7  8 9 10...........................63
 : : : : :  :  :  :  : : :                            :
 v v v v v  v  v  v  v v v                            v

,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x| |  |  |  |B |x| | |x|x|................|.|.|.|.| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
| | | | |  |  |  |  | | | | |..................|S|G|I|X| <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'

The above PTE changes is applicable to hugetlbpages aswell.

The patch does the following code changes:

a) moves the H_PAGE_F_SECOND and H_PAGE_F_GIX to 4k PTE
	header since it is no more needed b the 64k PTEs.
b) abstracts out __real_pte() and __rpte_to_hidx() so the
	caller need not know the bit location of the slot.
c) moves the slot bits to the secondary pte.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:22 +11:00
Ram Pai 9d2edb1848 powerpc: Free up four 64K PTE bits in 4K backed HPTE pages
Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6,
in the 4K backed HPTE pages.These bits continue to be used
for 64K backed HPTE pages in this patch, but will be freed
up in the next patch. The bit numbers are big-endian as
defined in the ISA3.0

The patch does the following change to the 4k HTPE backed
64K PTE's format.

H_PAGE_BUSY moves from bit 3 to bit 9 (B bit in the figure
		below)
V0 which occupied bit 4 is not used anymore.
V1 which occupied bit 5 is not used anymore.
V2 which occupied bit 6 is not used anymore.
V3 which occupied bit 7 is not used anymore.

Before the patch, the 4k backed 64k PTE format was as follows

 0 1 2 3 4  5  6  7  8 9 10...........................63
 : : : : :  :  :  :  : : :                            :
 v v v v v  v  v  v  v v v                            v

,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x|B|V0|V1|V2|V3|x| | |x|x|................|x|x|x|x| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
|S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'

After the patch, the 4k backed 64k PTE format is as follows

 0 1 2 3 4  5  6  7  8 9 10...........................63
 : : : : :  :  :  :  : : :                            :
 v v v v v  v  v  v  v v v                            v

,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
|x|x|x| |  |  |  |  |x|B| |x|x|................|.|.|.|.| <- primary pte
'_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
|S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte
'_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'

the four bits S,G,I,X (one quadruplet per 4k HPTE) that
cache the hash-bucket slot value, is initialized to
1,1,1,1 indicating -- an invalid slot. If a HPTE gets
cached in a 1111 slot(i.e 7th slot of secondary hash
bucket), it is released immediately. In other words,
even though 1111 is a valid slot value in the hash
bucket, we consider it invalid and release the slot and
the HPTE. This gives us the opportunity to determine
the validity of S,G,I,X bits based on its contents and
not on any of the bits V0,V1,V2 or V3 in the primary PTE

When we release a HPTE cached in the 1111 slot
we also release a legitimate slot in the primary
hash bucket and unmap its corresponding HPTE. This
is to ensure that we do get a HPTE cached in a slot
of the primary hash bucket, the next time we retry.

Though treating 1111 slot as invalid, reduces the
number of available slots in the hash bucket and may
have an effect on the performance, the probabilty of
hitting a 1111 slot is extermely low.

Compared to the current scheme, the above scheme
reduces the number of false hash table updates
significantly and has the added advantage of releasing
four valuable PTE bits for other purpose.

NOTE:even though bits 3, 4, 5, 6, 7 are not used when
the 64K PTE is backed by 4k HPTE, they continue to be
used if the PTE gets backed by 64k HPTE. The next
patch will decouple that aswell, and truely release the
bits.

This idea was jointly developed by Paul Mackerras,
Aneesh, Michael Ellermen and myself.

4K PTE format remains unchanged currently.

The patch does the following code changes
a) PTE flags are split between 64k and 4k header files.
b) __hash_page_4K() is reimplemented to reflect the
 above logic.

Acked-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:20 +11:00
Ram Pai 318995b4f5 powerpc: introduce pte_get_hash_gslot() helper
Introduce pte_get_hash_gslot()() which returns the global slot number of
the HPTE in the global hash table.

This function will come in handy as we work towards re-arranging the PTE
bits in the later patches.

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:19 +11:00
Ram Pai 59aa31fd6f powerpc: introduce pte_set_hidx() helper
Introduce pte_set_hidx().It sets the (H_PAGE_F_SECOND|H_PAGE_F_GIX) bits
at the appropriate location in the PTE of 4K PTE. For 64K PTE, it sets
the bits in the second part of the PTE. Though the implementation for
the former just needs the slot parameter, it does take some additional
parameters to keep the prototype consistent.

This function will be handy as we work towards re-arranging the bits in
the subsequent patches.

Acked-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-20 18:57:19 +11:00
Bjorn Helgaas 961292e664 powerpc: Set I/O port resource types correctly
Set I/O port resource structs to have IORESOURCE_IO in their type field.

Previously we marked these as merely IORESOURCE_BUSY without indicating the
type.  Setting the type doesn't fix any functional problem but makes %pR
on the resource work better.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2017-12-18 23:07:45 -06:00
Andy Shevchenko 9d987eba8e powerpc/pci: Unroll two pass loop when scanning bridges
The current scanning code is really hard to understand because it calls the
same function in a loop where pass value is changed without any comments
explaining it:

  for (pass = 0; pass < 2; pass++)
    for_each_pci_bridge(dev, bus)
      max = pci_scan_bridge(bus, dev, max, pass);

Unfamiliar reader cannot tell easily what is the purpose of this loop
without looking at internals of pci_scan_bridge().

In order to make this bit easier to understand, open-code the loop in
pci_scan_child_bus() and pci_hp_add_bridge() with added comments.

No functional changes intended.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2017-12-18 23:05:52 -06:00
Andy Shevchenko dd1ea5763e powerpc/pci: Use for_each_pci_bridge() helper
Use for_each_pci_bridge() helper to make the code slightly cleaner.  No
functional change intended.

Requires: 24a0c654d7 ("PCI: Add for_each_pci_bridge() helper")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-18 23:05:52 -06:00
Michael Ellerman 182dc9c7f2 powerpc/kernel: Print actual address of regs when oopsing
When we oops or otherwise call show_regs() we print the address of the
regs structure. Being able to see the address is fairly useful,
firstly to verify that the regs pointer is not completely bogus, and
secondly it allows you to dump the regs and surrounding memory with a
debugger if you have one.

In the normal case the regs will be located somewhere on the stack, so
printing their location discloses no further information than printing
the stack pointer does already.

So switch to %px and print the actual address, not the hashed value.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-19 13:09:40 +11:00
David S. Miller 59436c9ee1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2017-12-18

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Allow arbitrary function calls from one BPF function to another BPF function.
   As of today when writing BPF programs, __always_inline had to be used in
   the BPF C programs for all functions, unnecessarily causing LLVM to inflate
   code size. Handle this more naturally with support for BPF to BPF calls
   such that this __always_inline restriction can be overcome. As a result,
   it allows for better optimized code and finally enables to introduce core
   BPF libraries in the future that can be reused out of different projects.
   x86 and arm64 JIT support was added as well, from Alexei.

2) Add infrastructure for tagging functions as error injectable and allow for
   BPF to return arbitrary error values when BPF is attached via kprobes on
   those. This way of injecting errors generically eases testing and debugging
   without having to recompile or restart the kernel. Tags for opting-in for
   this facility are added with BPF_ALLOW_ERROR_INJECTION(), from Josef.

3) For BPF offload via nfp JIT, add support for bpf_xdp_adjust_head() helper
   call for XDP programs. First part of this work adds handling of BPF
   capabilities included in the firmware, and the later patches add support
   to the nfp verifier part and JIT as well as some small optimizations,
   from Jakub.

4) The bpftool now also gets support for basic cgroup BPF operations such
   as attaching, detaching and listing current BPF programs. As a requirement
   for the attach part, bpftool can now also load object files through
   'bpftool prog load'. This reuses libbpf which we have in the kernel tree
   as well. bpftool-cgroup man page is added along with it, from Roman.

5) Back then commit e87c6bc385 ("bpf: permit multiple bpf attachments for
   a single perf event") added support for attaching multiple BPF programs
   to a single perf event. Given they are configured through perf's ioctl()
   interface, the interface has been extended with a PERF_EVENT_IOC_QUERY_BPF
   command in this work in order to return an array of one or multiple BPF
   prog ids that are currently attached, from Yonghong.

6) Various minor fixes and cleanups to the bpftool's Makefile as well
   as a new 'uninstall' and 'doc-uninstall' target for removing bpftool
   itself or prior installed documentation related to it, from Quentin.

7) Add CONFIG_CGROUP_BPF=y to the BPF kernel selftest config file which is
   required for the test_dev_cgroup test case to run, from Naresh.

8) Fix reporting of XDP prog_flags for nfp driver, from Jakub.

9) Fix libbpf's exit code from the Makefile when libelf was not found in
   the system, also from Jakub.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-18 10:51:06 -05:00
Alexei Starovoitov 60b58afc96 bpf: fix net.core.bpf_jit_enable race
global bpf_jit_enable variable is tested multiple times in JITs,
blinding and verifier core. The malicious root can try to toggle
it while loading the programs. This race condition was accounted
for and there should be no issues, but it's safer to avoid
this race condition.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-17 20:34:36 +01:00
Daniel Borkmann 87338c8e2c bpf, ppc64: do not reload skb pointers in non-skb context
The assumption of unconditionally reloading skb pointers on
BPF helper calls where bpf_helper_changes_pkt_data() holds
true is wrong. There can be different contexts where the helper
would enforce a reload such as in case of XDP. Here, we do
have a struct xdp_buff instead of struct sk_buff as context,
thus this will access garbage.

JITs only ever need to deal with cached skb pointer reload
when ld_abs/ind was seen, therefore guard the reload behind
SEEN_SKB.

Fixes: 156d0e290e ("powerpc/ebpf/jit: Implement JIT compiler for extended BPF")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2017-12-15 09:19:35 -08:00
Paolo Bonzini 5cb0944c0c KVM: introduce kvm_arch_vcpu_async_ioctl
After the vcpu_load/vcpu_put pushdown, the handling of asynchronous VCPU
ioctl is already much clearer in that it is obvious that they bypass
vcpu_load and vcpu_put.

However, it is still not perfect in that the different state of the VCPU
mutex is still hidden in the caller.  Separate those ioctls into a new
function kvm_arch_vcpu_async_ioctl that returns -ENOIOCTLCMD for more
"traditional" synchronous ioctls.

Cc: James Hogan <jhogan@kernel.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Suggested-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:59 +01:00
Christoffer Dall 9b062471e5 KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl
Move the calls to vcpu_load() and vcpu_put() in to the architecture
specific implementations of kvm_arch_vcpu_ioctl() which dispatches
further architecture-specific ioctls on to other functions.

Some architectures support asynchronous vcpu ioctls which cannot call
vcpu_load() or take the vcpu->mutex, because that would prevent
concurrent execution with a running VCPU, which is the intended purpose
of these ioctls, for example because they inject interrupts.

We repeat the separate checks for these specifics in the architecture
code for MIPS, S390 and PPC, and avoid taking the vcpu->mutex and
calling vcpu_load for these ioctls.

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:58 +01:00
Christoffer Dall 66b5656222 KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_guest_debug
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_guest_debug().

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:56 +01:00
Christoffer Dall 1da5b61dac KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_translate
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_translate().

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:55 +01:00
Christoffer Dall b4ef9d4e8c KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_sregs
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_sregs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:53 +01:00
Christoffer Dall bcdec41cef KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_get_sregs
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_get_sregs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:52 +01:00
Christoffer Dall 875656fe0c KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_regs
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_set_regs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:52 +01:00
Christoffer Dall 1fc9b76b3d KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_get_regs
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_get_regs().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:51 +01:00
Christoffer Dall accb757d79 KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_run
Move vcpu_load() and vcpu_put() into the architecture specific
implementations of kvm_arch_vcpu_ioctl_run().

Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> # s390 parts
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
[Rebased. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-14 09:26:50 +01:00
Anju T Sudhakar 110df8bd3e powerpc/perf: Fix kfree memory allocated for nest pmus
imc_common_cpuhp_mem_free() is the common function for all
IMC (In-memory Collection counters) domains to unregister cpuhotplug
callback and free memory. Since kfree of memory allocated for
nest-imc (per_nest_pmu_arr) is in the common code, all
domains (core/nest/thread) can do the kfree in the failure case.

This could potentially create a call trace as shown below, where
core(/thread/nest) imc pmu initialization fails and in the failure
path imc_common_cpuhp_mem_free() free the memory(per_nest_pmu_arr),
which is allocated by successfully registered nest units.

The call trace is generated in a scenario where core-imc
initialization is made to fail and a cpuhotplug is performed in a p9
system. During cpuhotplug ppc_nest_imc_cpu_offline() tries to access
per_nest_pmu_arr, which is already freed by core-imc.

  NIP [c000000000cb6a94] mutex_lock+0x34/0x90
  LR [c000000000cb6a88] mutex_lock+0x28/0x90
  Call Trace:
    mutex_lock+0x28/0x90 (unreliable)
    perf_pmu_migrate_context+0x90/0x3a0
    ppc_nest_imc_cpu_offline+0x190/0x1f0
    cpuhp_invoke_callback+0x160/0x820
    cpuhp_thread_fun+0x1bc/0x270
    smpboot_thread_fn+0x250/0x290
    kthread+0x1a8/0x1b0
    ret_from_kernel_thread+0x5c/0x74

To address this scenario do the kfree(per_nest_pmu_arr) only in case
of nest-imc initialization failure, and when there is no other nest
units registered.

Fixes: 73ce9aec65 ("powerpc/perf: Fix IMC_MAX_PMU macro")
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-13 20:51:22 +11:00
Anju T Sudhakar ad2b6e0102 powerpc/perf/imc: Fix nest-imc cpuhotplug callback failure
Oops is observed during boot:

  Faulting instruction address: 0xc000000000248340
  cpu 0x0: Vector: 380 (Data Access Out of Range) at [c000000ff66fb850]
      pc: c000000000248340: event_function_call+0x50/0x1f0
      lr: c00000000024878c: perf_remove_from_context+0x3c/0x100
      sp: c000000ff66fbad0
     msr: 9000000000009033
     dar: 7d20e2a6f92d03c0
    pid = 14, comm = cpuhp/0

While registering the cpuhotplug callbacks for nest-imc, if we fail in
the cpuhotplug online path for any random node in a multi node
system (because the opal call to stop nest-imc counters fails for that
node), ppc_nest_imc_cpu_offline() will get invoked for other nodes who
successfully returned from cpuhotplug online path.

This call trace is generated since in the ppc_nest_imc_cpu_offline()
path we are trying to migrate the event context, when nest-imc
counters are not even initialized.

Patch to add a check to ensure that nest-imc is registered before
migrating the event context.

Fixes: 885dcd709b ("powerpc/perf: Add nest IMC PMU support")
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Reviewed-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-13 20:36:53 +11:00
Ravi Bangoria f41d84dddc powerpc/perf: Dereference BHRB entries safely
It's theoretically possible that branch instructions recorded in
BHRB (Branch History Rolling Buffer) entries have already been
unmapped before they are processed by the kernel. Hence, trying to
dereference such memory location will result in a crash. eg:

    Unable to handle kernel paging request for data at address 0xd000000019c41764
    Faulting instruction address: 0xc000000000084a14
    NIP [c000000000084a14] branch_target+0x4/0x70
    LR [c0000000000eb828] record_and_restart+0x568/0x5c0
    Call Trace:
    [c0000000000eb3b4] record_and_restart+0xf4/0x5c0 (unreliable)
    [c0000000000ec378] perf_event_interrupt+0x298/0x460
    [c000000000027964] performance_monitor_exception+0x54/0x70
    [c000000000009ba4] performance_monitor_common+0x114/0x120

Fix it by deferefencing the addresses safely.

Fixes: 691231846c ("powerpc/perf: Fix setting of "to" addresses for BHRB")
Cc: stable@vger.kernel.org # v3.10+
Suggested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
[mpe: Use probe_kernel_read() which is clearer, tweak change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-13 20:29:20 +11:00
Bryant G. Ly 608c0d8804 PCI/IOV: Add pci_vf_drivers_autoprobe() interface
Add a pci_vf_drivers_autoprobe() interface. Setting autoprobe to false
on the PF prevents drivers from binding to VFs when they are enabled.

Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Juan J. Alvarez <jjalvare@linux.vnet.ibm.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:36 +11:00
Bryant G. Ly dae7253f9f powerpc/pseries: Add pseries SR-IOV Machine dependent calls
Add calls for pseries platform to configure/deconfigure SR-IOV.

Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Juan J. Alvarez <jjalvare@us.ibm.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:36 +11:00
Bryant G. Ly 988fc3ba56 powerpc/pci: Separate SR-IOV Calls
SR-IOV can now be enabled for the powernv platform and pseries
platform. Therefore move the appropriate calls to machine dependent
code instead of relying on definition at compile time.

Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Juan J. Alvarez <jjalvare@us.ibm.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:35 +11:00
Alan Modra 5c45b52801 powerpc/modules: Fix alignment of .toc section in kernel modules
powerpc64 gcc can generate code that offsets an address, to access
part of an object in memory. If the address is a -mcmodel=medium toc
pointer relative address then code like the following is possible.

  addis r9,r2,var@toc@ha
  ld r3,var@toc@l(r9)
  ld r4,(var+8)@toc@l(r9)

This works fine so long as var is naturally aligned, *and* r2 is
sufficiently aligned. If not, there is a possibility that the offset
added to access var+8 wraps over a n*64k+32k boundary. Modules don't
have any guarantee that r2 is sufficiently aligned. Moreover, code
generated by older compilers generates a .toc section with 2**0
alignment, which can result in relocation failures at module load time
even without the wrap problem.

Thus, this patch links modules with an aligned .toc section (Makefile
and module.lds changes), and forces alignment for out of tree modules
or those without a .toc section (module_64.c changes).

Signed-off-by: Alan Modra <amodra@gmail.com>
[desnesn: updated patch to apply to powerpc-next kernel v4.15 ]
Signed-off-by: Desnes A. Nunes do Rosario <desnesn@linux.vnet.ibm.com>
[mpe: Fix out-of-tree build, swap -256 for ~0xff, reflow comment]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:35 +11:00
Nicholas Piggin acb1feab32 powerpc/64: Don't trace irqs-off at interrupt return to soft-disabled context
When an interrupt is returning to a soft-disabled context (which can
happen for non-maskable interrupts or synchronous interrupts), it goes
through the motions of soft-disabling again, including calling
TRACE_DISABLE_INTS (i.e., trace_hardirqs_off()).

This is not necessary, because we must already be soft-disabled in the
interrupt context, it also may be causing crashes in the irq tracing
code to re-enter as an nmi. Replace it with a warning to ensure that
soft-interrupts are still disabled.

Fixes: 7c0482e3d0 ("powerpc/irq: Fix another case of lazy IRQ state getting out of sync")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:34 +11:00
Ivan Mikhaylov 7813043e1b powerpc/44x/fsp2: Add irq error handlers
Add irq error handlers for cmu, plb, opb, mcue, conf
with debug information output in case of problems.

Signed-off-by: Ivan Mikhaylov <ivan@de.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:34 +11:00
Ivan Mikhaylov 50f01c57d7 powerpc/44x/fsp2: tvsense workaround for dd1
TVSENSE(temperature and voltage sensors) reset is blocked (clock gated)
by the POR default of the TVS sleep config bit. As a consequence,
TVSENSE will provide erratic sensor values, which may result in
spurious (parity) errors recorded in the CMU FIR and leading to
erroneous interrupt requests once the CMU interrupt is unmasked.
Purpose of this to set up CMU in working state in any cases even
in case of parity errors.

Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Ivan Mikhaylov <ivan@de.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:33 +11:00
Ivan Mikhaylov 9c4c374676 powerpc/44x/fsp2: Interrupt handling setup
* clear out any possible plb6 errors
* board interrupt handling setup within l2 reg set
* fsp2 parity error setup

All those points are needed for correct interrupt
handling on board level including error handling report.

Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Ivan Mikhaylov <ivan@de.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:32 +11:00
Ivan Mikhaylov 494d82ceae powerpc/44x/fsp2: Add fsp2 headers
Add cmu, plbX, l2, ddr3/4, crcs register definitions.
Add mfcmu, mtcmu functions for indirect access to cmu.
Add mtl2, mfl2 same for l2 cache core reg set.

Signed-off-by: Ivan Mikhaylov <ivan@de.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:32 +11:00
Nicholas Piggin a10726075d powerpc/32: Add .data.rel* sections explicitly
Match powerpc/64 and include .data.rel* input sections in the .data output
section explicitly.

This solves the warning:

powerpc-linux-gnu-ld: warning: orphan section `.data.rel.ro' from `arch/powerpc/kernel/head_44x.o' being placed in section `.data.rel.ro'.

Link: https://lists.01.org/pipermail/kbuild-all/2017-November/040010.html
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:31 +11:00
Michael Ellerman c3bb690dc6 powerpc/boot: Only build CPM code when necessary
As far as I can tell CONFIG_CPM is the right symbol to use to
conditionally compile the cpm-serial.c code.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:31 +11:00
Michael Ellerman f8e8e69cea powerpc/boot: Only build OPAL code when necessary
Only build the OPAL console code in when necessary. This looks like it
should use CONFIG_PPC_POWERNV, but because the opal-call.S code is
64-bit only, we must only build it when we're building the boot
wrapper 64-bit.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:30 +11:00
Michael Ellerman 3d6bf693d8 powerpc/boot: Only build uartlite if XILINX_VIRTEX=y
The serial code in uartlite.c only matches if we find one of two
Xilinx (xlnx) nodes in the device tree, there's no need to build or
link the code on other platforms.

As far as I can tell CONFIG_XILINX_VIRTEX is the appropriate symbol to
use to conditionally compile the code.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:30 +11:00
Josh Poimboeuf 1ea61ea239 powerpc/modules: Improve restore_r2() error message
Print the function address associated with the restore_r2() error to
make it easier to debug the problem.

Also clarify the wording a bit.

Before:

  module_64: patch_foo: Expect noop after relocate, got 3c820000

After:

  module_64: patch_foo: Expected nop after call, got 7c630034 at netdev_has_upper_dev+0x54/0xb0 [patch_foo]

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
[mpe: Change noop to nop, as that's the name of the instruction]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:29 +11:00
Josh Poimboeuf b9eab08d01 powerpc/modules: Don't try to restore r2 after a sibling call
When attempting to load a livepatch module, I got the following error:

  module_64: patch_module: Expect noop after relocate, got 3c820000

The error was triggered by the following code in
unregister_netdevice_queue():

  14c:   00 00 00 48     b       14c <unregister_netdevice_queue+0x14c>
                         14c: R_PPC64_REL24      net_set_todo
  150:   00 00 82 3c     addis   r4,r2,0

GCC didn't insert a nop after the branch to net_set_todo() because it's
a sibling call, so it never returns.  The nop isn't needed after the
branch in that case.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-and-tested-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:29 +11:00
Kamalesh Babulal a443bf6e8a powerpc/modules: Add REL24 relocation support of livepatch symbols
Livepatch re-uses module loader function apply_relocate_add() to write
relocations, instead of managing them by arch-dependent
klp_write_module_reloc() function.

apply_relocate_add() doesn't understand livepatch symbols (marked with
SHN_LIVEPATCH symbol section index) and assumes them to be local
symbols by default for R_PPC64_REL24 relocation type. It fails with an
error, when trying to calculate offset with local_entry_offset():

  module_64: kpatch_meminfo: REL24 -1152921504897399800 out of range!

Whereas livepatch symbols are essentially SHN_UNDEF, should be called
via stub used for global calls. This issue can be fixed by teaching
apply_relocate_add() to handle both SHN_UNDEF/SHN_LIVEPATCH symbols
via the same stub. This patch extends SHN_UNDEF code to handle
livepatch symbols too.

Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:28 +11:00
Benjamin Herrenschmidt 5138b31422 powerpc: Reduce log level of "OPAL detected !" message
This message isn't terribly useful.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:28 +11:00
Benjamin Herrenschmidt 10de741fd3 powerpc: Remove DEBUG define in 64-bit early setup code
This statement causes some not very useful messages to always
be printed on the serial port at boot, even on quiet boots.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-11 13:03:27 +11:00
David S. Miller 51e18a453f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflict was two parallel additions of include files to sch_generic.c,
no biggie.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-09 22:09:55 -05:00
Linus Torvalds e9ef1fe312 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) CAN fixes from Martin Kelly (cancel URBs properly in all the CAN usb
    drivers).

 2) Revert returning -EEXIST from __dev_alloc_name() as this propagates
    to userspace and broke some apps. From Johannes Berg.

 3) Fix conn memory leaks and crashes in TIPC, from Jon Malloc and Cong
    Wang.

 4) Gianfar MAC can't do EEE so don't advertise it by default, from
    Claudiu Manoil.

 5) Relax strict netlink attribute validation, but emit a warning. From
    David Ahern.

 6) Fix regression in checksum offload of thunderx driver, from Florian
    Westphal.

 7) Fix UAPI bpf issues on s390, from Hendrik Brueckner.

 8) New card support in iwlwifi, from Ihab Zhaika.

 9) BBR congestion control bug fixes from Neal Cardwell.

10) Fix port stats in nfp driver, from Pieter Jansen van Vuuren.

11) Fix leaks in qualcomm rmnet, from Subash Abhinov Kasiviswanathan.

12) Fix DMA API handling in sh_eth driver, from Thomas Petazzoni.

13) Fix spurious netpoll warnings in bnxt_en, from Calvin Owens.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
  net: mvpp2: fix the RSS table entry offset
  tcp: evaluate packet losses upon RTT change
  tcp: fix off-by-one bug in RACK
  tcp: always evaluate losses in RACK upon undo
  tcp: correctly test congestion state in RACK
  bnxt_en: Fix sources of spurious netpoll warnings
  tcp_bbr: reset long-term bandwidth sampling on loss recovery undo
  tcp_bbr: reset full pipe detection on loss recovery undo
  tcp_bbr: record "full bw reached" decision in new full_bw_reached bit
  sfc: pass valid pointers from efx_enqueue_unwind
  gianfar: Disable EEE autoneg by default
  tcp: invalidate rate samples during SACK reneging
  can: peak/pcie_fd: fix potential bug in restarting tx queue
  can: usb_8dev: cancel urb on -EPIPE and -EPROTO
  can: kvaser_usb: cancel urb on -EPIPE and -EPROTO
  can: esd_usb2: cancel urb on -EPIPE and -EPROTO
  can: ems_usb: cancel urb on -EPIPE and -EPROTO
  can: mcba_usb: cancel urb on -EPROTO
  usbnet: fix alignment for frames with no ethernet header
  tcp: use current time in tcp_rcv_space_adjust()
  ...
2017-12-08 13:32:44 -08:00
Michael Ellerman d810418208 powerpc/xmon: Don't print hashed pointers in xmon
Since commit ad67b74d24 ("printk: hash addresses printed with %p")
pointers printed with %p are hashed, ie. you don't see the actual
pointer value but rather a cryptographic hash of its value.

In xmon we want to see the actual pointer values, because xmon is a
debugger, so replace %p with %px which prints the actual pointer
value.

We justify doing this in xmon because 1) xmon is a kernel crash
debugger, it's only accessible via the console 2) xmon doesn't print
to dmesg, so the pointers it prints are not able to be leaked that
way.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-07 00:27:01 +11:00
Nicholas Piggin 371b80447f powerpc/64s: Initialize ISAv3 MMU registers before setting partition table
kexec can leave MMU registers set when booting into a new kernel,
the PIDR (Process Identification Register) in particular. The boot
sequence does not zero PIDR, so it only gets set when CPUs first
switch to a userspace processes (until then it's running a kernel
thread with effective PID = 0).

This leaves a window where a process table entry and page tables are
set up due to user processes running on other CPUs, that happen to
match with a stale PID. The CPU with that PID may cause speculative
accesses that address quadrant 0 (aka userspace addresses), which will
result in cached translations and PWC (Page Walk Cache) for that
process, on a CPU which is not in the mm_cpumask and so they will not
be invalidated properly.

The most common result is the kernel hanging in infinite page fault
loops soon after kexec (usually in schedule_tail, which is usually the
first non-speculative quadrant 0 access to a new PID) due to a stale
PWC. However being a stale translation error, it could result in
anything up to security and data corruption problems.

Fix this by zeroing out PIDR at boot and kexec.

Fixes: 7e381c0ff6 ("powerpc/mm/radix: Add mmu context handling callback for radix")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-06 23:32:43 +11:00
Serhii Popovych 4ed11aeefd KVM: PPC: Book3S HV: Fix use after free in case of multiple resize requests
When serving multiple resize requests following could happen:

    CPU0                                    CPU1
    ----                                    ----
    kvm_vm_ioctl_resize_hpt_prepare(1);
      -> schedule_work()
                                            /* system_rq might be busy: delay */
    kvm_vm_ioctl_resize_hpt_prepare(2);
      mutex_lock();
      if (resize) {
         ...
         release_hpt_resize();
      }
      ...                                   resize_hpt_prepare_work()
      -> schedule_work()                    {
      mutex_unlock()                           /* resize->kvm could be wrong */
                                               struct kvm *kvm = resize->kvm;

                                               mutex_lock(&kvm->lock);   <<<< UAF
                                               ...
                                            }

i.e. a second resize request with different order could be started by
kvm_vm_ioctl_resize_hpt_prepare(), causing the previous request to be
free()d when there's still an active worker thread which will try to
access it.  This leads to a use after free in point marked with UAF on
the diagram above.

To prevent this from happening, instead of unconditionally releasing a
pre-existing resize structure from the prepare ioctl(), we check if
the existing structure has an in-progress worker.  We do that by
checking if the resize->error == -EBUSY, which is safe because the
resize->error field is protected by the kvm->lock.  If there is an
active worker, instead of releasing, we mark the structure as stale by
unlinking it from kvm_struct.

In the worker thread we check for a stale structure (with kvm->lock
held), and in that case abort, releasing the stale structure ourself.
We make the check both before and the actual allocation.  Strictly,
only the check afterwards is needed, the check before is an
optimization: if the structure happens to become stale before the
worker thread is dispatched, rather than during the allocation, it
means we can avoid allocating then immediately freeing a potentially
substantial amount of memory.

This fixes following or similar host kernel crash message:

[  635.277361] Unable to handle kernel paging request for data at address 0x00000000
[  635.277438] Faulting instruction address: 0xc00000000052f568
[  635.277446] Oops: Kernel access of bad area, sig: 11 [#1]
[  635.277451] SMP NR_CPUS=2048 NUMA PowerNV
[  635.277470] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter nfsv3 nfs_acl nfs
lockd grace fscache kvm_hv kvm rpcrdma sunrpc ib_isert iscsi_target_mod ib_iser libiscsi
scsi_transport_iscsi ib_srpt target_core_mod ext4 ib_srp scsi_transport_srp
ib_ipoib mbcache jbd2 rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ocrdma(T)
ib_core ses enclosure scsi_transport_sas sg shpchp leds_powernv ibmpowernv i2c_opal
i2c_core powernv_rng ipmi_powernv ipmi_devintf ipmi_msghandler ip_tables xfs
libcrc32c sr_mod sd_mod cdrom lpfc nvme_fc(T) nvme_fabrics nvme_core ipr nvmet_fc(T)
tg3 nvmet libata be2net crc_t10dif crct10dif_generic scsi_transport_fc ptp scsi_tgt
pps_core crct10dif_common dm_mirror dm_region_hash dm_log dm_mod
[  635.278687] CPU: 40 PID: 749 Comm: kworker/40:1 Tainted: G
------------ T 3.10.0.bz1510771+ #1
[  635.278782] Workqueue: events resize_hpt_prepare_work [kvm_hv]
[  635.278851] task: c0000007e6840000 ti: c0000007e9180000 task.ti: c0000007e9180000
[  635.278919] NIP: c00000000052f568 LR: c0000000009ea310 CTR: c0000000009ea4f0
[  635.278988] REGS: c0000007e91837f0 TRAP: 0300   Tainted: G
------------ T  (3.10.0.bz1510771+)
[  635.279077] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24002022  XER:
00000000
[  635.279248] CFAR: c000000000009368 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
GPR00: c0000000009ea310 c0000007e9183a70 c000000001250b00 c0000007e9183b10
GPR04: 0000000000000000 0000000000000000 c0000007e9183650 0000000000000000
GPR08: c0000007ffff7b80 00000000ffffffff 0000000080000028 d00000000d2529a0
GPR12: 0000000000002200 c000000007b56800 c000000000120028 c0000007f135bb40
GPR16: 0000000000000000 c000000005c1e018 c000000005c1e018 0000000000000000
GPR20: 0000000000000001 c0000000011bf778 0000000000000001 fffffffffffffef7
GPR24: 0000000000000000 c000000f1e262e50 0000000000000002 c0000007e9180000
GPR28: c000000f1e262e4c c000000f1e262e50 0000000000000000 c0000007e9183b10
[  635.280149] NIP [c00000000052f568] __list_add+0x38/0x110
[  635.280197] LR [c0000000009ea310] __mutex_lock_slowpath+0xe0/0x2c0
[  635.280253] Call Trace:
[  635.280277] [c0000007e9183af0] [c0000000009ea310] __mutex_lock_slowpath+0xe0/0x2c0
[  635.280356] [c0000007e9183b70] [c0000000009ea554] mutex_lock+0x64/0x70
[  635.280426] [c0000007e9183ba0] [d00000000d24da04]
resize_hpt_prepare_work+0xe4/0x1c0 [kvm_hv]
[  635.280507] [c0000007e9183c40] [c000000000113c0c] process_one_work+0x1dc/0x680
[  635.280587] [c0000007e9183ce0] [c000000000114250] worker_thread+0x1a0/0x520
[  635.280655] [c0000007e9183d80] [c00000000012010c] kthread+0xec/0x100
[  635.280724] [c0000007e9183e30] [c00000000000a4b8] ret_from_kernel_thread+0x5c/0xa4
[  635.280814] Instruction dump:
[  635.280880] 7c0802a6 fba1ffe8 fbc1fff0 7cbd2b78 fbe1fff8 7c9e2378 7c7f1b78
f8010010
[  635.281099] f821ff81 e8a50008 7fa52040 40de00b8 <e8be0000> 7fbd2840 40de008c
7fbff040
[  635.281324] ---[ end trace b628b73449719b9d ]---

Cc: stable@vger.kernel.org # v4.10+
Fixes: b5baa68773 ("KVM: PPC: Book3S HV: KVM-HV HPT resizing implementation")
Signed-off-by: Serhii Popovych <spopovyc@redhat.com>
[dwg: Replaced BUG_ON()s with WARN_ONs() and reworded commit message
 for clarity]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-12-06 13:36:22 +11:00
Serhii Popovych 3073774e63 KVM: PPC: Book3S HV: Drop prepare_done from struct kvm_resize_hpt
Currently the kvm_resize_hpt structure has two fields relevant to the
state of an ongoing resize: 'prepare_done', which indicates whether
the worker thread has completed or not, and 'error' which indicates
whether it was successful or not.

Since the success/failure isn't known until completion, this is
confusingly redundant.  This patch consolidates the information into
just the 'error' value: -EBUSY indicates the worked is still in
progress, other negative values indicate (completed) failure, 0
indicates successful completion.

As a bonus this reduces size of struct kvm_resize_hpt by
__alignof__(struct kvm_hpt_info) and saves few bytes of code.

While there correct comment in struct kvm_resize_hpt which references
a non-existent semaphore (leftover from an early draft).

Assert with WARN_ON() in case of HPT allocation thread work runs more
than once for resize request or resize_hpt_allocate() returns -EBUSY
that is treated specially.

Change comparison against zero to make checkpatch.pl happy.

Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Serhii Popovych <spopovyc@redhat.com>
[dwg: Changed BUG_ON()s to WARN_ON()s and altered commit message for
 clarity]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-12-06 13:35:21 +11:00
David S. Miller 7cda4cee13 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Small overlapping change conflict ('net' changed a line,
'net-next' added a line right afterwards) in flexcan.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-05 10:44:19 -05:00
Hendrik Brueckner c895f6f703 bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program type
Commit 0515e5999a ("bpf: introduce BPF_PROG_TYPE_PERF_EVENT
program type") introduced the bpf_perf_event_data structure which
exports the pt_regs structure.  This is OK for multiple architectures
but fail for s390 and arm64 which do not export pt_regs.  Programs
using them, for example, the bpf selftest fail to compile on these
architectures.

For s390, exporting the pt_regs is not an option because s390 wants
to allow changes to it.  For arm64, there is a user_pt_regs structure
that covers parts of the pt_regs structure for use by user space.

To solve the broken uapi for s390 and arm64, introduce an abstract
type for pt_regs and add an asm/bpf_perf_event.h file that concretes
the type.  An asm-generic header file covers the architectures that
export pt_regs today.

The arch-specific enablement for s390 and arm64 follows in separate
commits.

Reported-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Fixes: 0515e5999a ("bpf: introduce BPF_PROG_TYPE_PERF_EVENT program type")
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Reviewed-and-tested-by: Thomas Richter <tmricht@linux.vnet.ibm.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2017-12-05 15:02:40 +01:00
David Gibson ab9dbf771f Revert "powerpc: Do not call ppc_md.panic in fadump panic notifier"
This reverts commit a3b2cb30f2.

That commit tried to fix problems with panic on powerpc in certain
circumstances, where some output from the generic panic code was being
dropped.

Unfortunately, it breaks things worse in other circumstances. In
particular when running a PAPR guest, it will now attempt to reboot
instead of informing the hypervisor (KVM or PowerVM) that the guest
has crashed. The crash notification is important to some
virtualization management layers.

Revert it for now until we can come up with a better solution.

Fixes: a3b2cb30f2 ("powerpc: Do not call ppc_md.panic in fadump panic notifier")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[mpe: Tweak change log a bit]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-05 23:21:46 +11:00
Miroslav Benes 43347d56c8 livepatch: send a fake signal to all blocking tasks
Live patching consistency model is of LEAVE_PATCHED_SET and
SWITCH_THREAD. This means that all tasks in the system have to be marked
one by one as safe to call a new patched function. Safe means when a
task is not (sleeping) in a set of patched functions. That is, no
patched function is on the task's stack. Another clearly safe place is
the boundary between kernel and userspace. The patching waits for all
tasks to get outside of the patched set or to cross the boundary. The
transition is completed afterwards.

The problem is that a task can block the transition for quite a long
time, if not forever. It could sleep in a set of patched functions, for
example.  Luckily we can force the task to leave the set by sending it a
fake signal, that is a signal with no data in signal pending structures
(no handler, no sign of proper signal delivered). Suspend/freezer use
this to freeze the tasks as well. The task gets TIF_SIGPENDING set and
is woken up (if it has been sleeping in the kernel before) or kicked by
rescheduling IPI (if it was running on other CPU). This causes the task
to go to kernel/userspace boundary where the signal would be handled and
the task would be marked as safe in terms of live patching.

There are tasks which are not affected by this technique though. The
fake signal is not sent to kthreads. They should be handled differently.
They can be woken up so they leave the patched set and their
TIF_PATCH_PENDING can be cleared thanks to stack checking.

For the sake of completeness, if the task is in TASK_RUNNING state but
not currently running on some CPU it doesn't get the IPI, but it would
eventually handle the signal anyway. Second, if the task runs in the
kernel (in TASK_RUNNING state) it gets the IPI, but the signal is not
handled on return from the interrupt. It would be handled on return to
the userspace in the future when the fake signal is sent again. Stack
checking deals with these cases in a better way.

If the task was sleeping in a syscall it would be woken by our fake
signal, it would check if TIF_SIGPENDING is set (by calling
signal_pending() predicate) and return ERESTART* or EINTR. Syscalls with
ERESTART* return values are restarted in case of the fake signal (see
do_signal()). EINTR is propagated back to the userspace program. This
could disturb the program, but...

* each process dealing with signals should react accordingly to EINTR
  return values.
* syscalls returning EINTR happen to be quite common situation in the
  system even if no fake signal is sent.
* freezer sends the fake signal and does not deal with EINTR anyhow.
  Thus EINTR values are returned when the system is resumed.

The very safe marking is done in architectures' "entry" on syscall and
interrupt/exception exit paths, and in a stack checking functions of
livepatch.  TIF_PATCH_PENDING is cleared and the next
recalc_sigpending() drops TIF_SIGPENDING. In connection with this, also
call klp_update_patch_state() before do_signal(), so that
recalc_sigpending() in dequeue_signal() can clear TIF_PATCH_PENDING
immediately and thus prevent a double call of do_signal().

Note that the fake signal is not sent to stopped/traced tasks. Such task
prevents the patching to finish till it continues again (is not traced
anymore).

Last, sending the fake signal is not automatic. It is done only when
admin requests it by writing 1 to signal sysfs attribute in livepatch
sysfs directory.

Signed-off-by: Miroslav Benes <mbenes@suse.cz>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: x86@kernel.org
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-12-04 22:34:57 +01:00
Al Viro 5bb866de17 ppc: for put_user() pull linux/uaccess.h, not asm/uaccess.h
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-04 14:43:20 -05:00
Ravi Bangoria 5aa04b3eb6 powerpc/perf: Fix oops when grouping different pmu events
When user tries to group imc (In-Memory Collections) event with
normal event, (sometime) kernel crashes with following log:

    Faulting instruction address: 0x00000000
    [link register   ] c00000000010ce88 power_check_constraints+0x128/0x980
    ...
    c00000000010e238 power_pmu_event_init+0x268/0x6f0
    c0000000002dc60c perf_try_init_event+0xdc/0x1a0
    c0000000002dce88 perf_event_alloc+0x7b8/0xac0
    c0000000002e92e0 SyS_perf_event_open+0x530/0xda0
    c00000000000b004 system_call+0x38/0xe0

'event_base' field of 'struct hw_perf_event' is used as flags for
normal hw events and used as memory address for imc events. While
grouping these two types of events, collect_events() tries to
interpret imc 'event_base' as a flag, which causes a corruption
resulting in a crash.

Consider only those events which belongs to 'perf_hw_context' in
collect_events().

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Reviewed-By: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-04 16:03:19 +11:00
Santosh Sivaraj 5c929885f1 powerpc/vdso64: Add support for CLOCK_{REALTIME/MONOTONIC}_COARSE
Current vDSO64 implementation does not have support for coarse clocks
(CLOCK_MONOTONIC_COARSE, CLOCK_REALTIME_COARSE), for which it falls back
to system call, increasing the response time, vDSO implementation reduces
the cycle time. Below is a benchmark of the difference in execution times.

(Non-coarse clocks are also included just for completion)

clock-gettime-realtime: syscall: 172 nsec/call
clock-gettime-realtime:    libc: 28 nsec/call
clock-gettime-realtime:    vdso: 22 nsec/call
clock-gettime-monotonic: syscall: 171 nsec/call
clock-gettime-monotonic:    libc: 30 nsec/call
clock-gettime-monotonic:    vdso: 25 nsec/call
clock-gettime-realtime-coarse: syscall: 153 nsec/call
clock-gettime-realtime-coarse:    libc: 16 nsec/call
clock-gettime-realtime-coarse:    vdso: 10 nsec/call
clock-gettime-monotonic-coarse: syscall: 167 nsec/call
clock-gettime-monotonic-coarse:    libc: 17 nsec/call
clock-gettime-monotonic-coarse:    vdso: 11 nsec/call

CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-04 15:01:09 +11:00
Dmitry Torokhov 0aa8ff9b76 powerpc: Use of for_each_node_by_name() instead of open-coding it
Instead of manually coding the loop with of_find_node_by_name(), let's
switch to the standard macro for iterating over nodes with given name.

Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
[mpe: Fix build failures due to typo in mpc832x_mds.c]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-04 13:10:59 +11:00
Dmitry Torokhov df26200299 powerpc/powermac: drop useless call to of_find_node_by_name
We are not using result, so this simply results in a leaked refcount.

Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-04 11:54:34 +11:00
Dmitry Torokhov 2aaf0a198d powerpc/powermac: fix OF node refcount leak
We need to call of_node_put() for device nodes obtained with
of_find_node_by_name().

Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-04 11:54:34 +11:00
Joe Perches f2c2cbcc35 powerpc: Use pr_warn instead of pr_warning
At some point, pr_warning will be removed so all logging messages use
a consistent <prefix>_warn style.

Update arch/powerpc/

Miscellanea:

o Coalesce formats
o Realign arguments
o Use %s, __func__ instead of embedded function names
o Remove unnecessary line continuations

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Geoff Levand <geoff@infradead.org>
[mpe: Rebase due to some %pOF changes.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-04 11:54:34 +11:00
Linus Torvalds a0651c7fa2 powerpc fixes for 4.15 #3
Two fixes for nasty kexec/kdump crashes in certain configurations.
 
 A couple of minor fixes for the new TIDR code.
 
 A fix for an oops in a CXL error handling path.
 
 Thanks to:
   Andrew Donnellan, Christophe Lombard, David Gibson, Mahesh Salgaonkar, Vaibhav Jain.
 -----BEGIN PGP SIGNATURE-----
 
 iQIwBAABCAAaBQJaITuLExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYBS
 9A/+IYjSoDdTcJXqfR46xn2147RGBQIym2rqaBJd+WwZj8Br0Yap5hrPtr1zAilD
 75aj0CRR0Y91nodnfishjujsJZckyQYOHi0/WQluLbpRlWEmeQt47gDjz70Wt1T8
 BZUEVqPF2k6Mk5WJV6sSIHBtw2uKrl/lJZAUJbTobOWgsMdopO504MkFxvySWKMV
 AX7UEXrcxPLb/yVGk9Ih9iwXxm/ymvQrkljp4s3jWqkc7bWwN93CmimIQ+X6bop0
 yqmAzCiUJsPsulmkBkmsY78llPg0roUrh98R4JIe0+cUiQROa5Kvt/u0zohN/rqS
 6SkPT0ds2Fs1z5cHayyQWMN0j0A5sfwW2KRMLHCJjAwAxzoT2CdMZDv0+QLi0ETy
 RGtYvnew8eCqrfBpyBneEP1JySARJ85ML4rZvudewSHJoMzTkYDnSEKU8+wlqRIf
 KHdvHmErRMlF7OB6Om3Uxz6oIXan/Puj7HsdL8f7MazjFPqb/r+/AuTDzUov17Fs
 7Y0qVawFyJyAJ8zkUAGB1kN2FN+eYnsFxUa7ubpeJY7VX+8pUOwT24rFc803eAu4
 p/ad1CpBy+8xaq83WeaM6BpMqW80ao2BzzbQUhDcEQl4qovO/ZEZxQt0ySoQp+SY
 MqE8SnZMkL/30CasbKTAqmt+P44GCSYZVtOUwTmvLNMTjSg=
 =/LD/
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "Two fixes for nasty kexec/kdump crashes in certain configurations.

  A couple of minor fixes for the new TIDR code.

  A fix for an oops in a CXL error handling path.

  Thanks to: Andrew Donnellan, Christophe Lombard, David Gibson, Mahesh
  Salgaonkar, Vaibhav Jain"

* tag 'powerpc-4.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc: Do not assign thread.tidr if already assigned
  powerpc: Avoid signed to unsigned conversion in set_thread_tidr()
  powerpc/kexec: Fix kexec/kdump in P9 guest kernels
  powerpc/powernv: Fix kexec crashes caused by tlbie tracing
  cxl: Check if vphb exists before iterating over AFU devices
2017-12-01 08:40:17 -05:00
Pankaj Bansal c8ae67fe02 powerpc: dts: P1010: Add endianness property to flexcan node
The flexcan driver assumed that flexcan controller is big endian for
powerpc architecture and little endian for other architectures.

But this is not universally true. flexcan controller can be little or
big endian on any architecture.

Therefore the flexcan driver has been modified to check for "big-endian"
device tree property for controllers that are big endian.

consequently add the property to freescale P1010 SOC device tree.

Signed-off-by: Pankaj Bansal <pankaj.bansal@nxp.com>
Reviewed-by: Poonam Aggrwal <poonam.aggrwal@nxp.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-12-01 09:14:23 +01:00
Linus Torvalds 9e0600f5cf * x86 bugfixes: APIC, nested virtualization, IOAPIC
* PPC bugfix: HPT guests on a POWER9 radix host
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJaICi1AAoJEL/70l94x66DjvEIAIML/e9YX1YrJZi0rsB9cbm0
 Le3o5b3wKxPrlZdnpOZQ2mVWubUQdiHMPGX6BkpgyiJWUchnbj5ql1gUf5S0i3jk
 TOk6nae6DU94xBuboeqZJlmx2VfPY/fqzLWsX3HFHpnzRl4XvXL5o7cWguIxVcVO
 yU6bPgbAXyXSBennLWZxC3aQ2Ojikr3uxZQpUZTAPOW5hFINpCKCpqJBMxsb67wq
 rwI0cJhRl92mHpbe8qeNJhavqY5eviy9iPUaZrOW9P4yw1uqjTAjgsUc1ydiaZSV
 rOHeKBOgVfY/KBaNJKyKySfuL1MJ+DLcQqm9RlGpKNpFIeB0vvSf0gtmmqIAXIk=
 =kh2y
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:

 - x86 bugfixes: APIC, nested virtualization, IOAPIC

 - PPC bugfix: HPT guests on a POWER9 radix host

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits)
  KVM: Let KVM_SET_SIGNAL_MASK work as advertised
  KVM: VMX: Fix vmx->nested freeing when no SMI handler
  KVM: VMX: Fix rflags cache during vCPU reset
  KVM: X86: Fix softlockup when get the current kvmclock
  KVM: lapic: Fixup LDR on load in x2apic
  KVM: lapic: Split out x2apic ldr calculation
  KVM: PPC: Book3S HV: Fix migration and HPT resizing of HPT guests on radix hosts
  KVM: vmx: use X86_CR4_UMIP and X86_FEATURE_UMIP
  KVM: x86: Fix CPUID function for word 6 (80000001_ECX)
  KVM: nVMX: Fix vmx_check_nested_events() return value in case an event was reinjected to L2
  KVM: x86: ioapic: Preserve read-only values in the redirection table
  KVM: x86: ioapic: Clear Remote IRR when entry is switched to edge-triggered
  KVM: x86: ioapic: Remove redundant check for Remote IRR in ioapic_set_irq
  KVM: x86: ioapic: Don't fire level irq when Remote IRR set
  KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race
  KVM: x86: inject exceptions produced by x86_decode_insn
  KVM: x86: Allow suppressing prints on RDMSR/WRMSR of unhandled MSRs
  KVM: x86: fix em_fxstor() sleeping while in atomic
  KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure
  KVM: nVMX: Validate the IA32_BNDCFGS on nested VM-entry
  ...
2017-11-30 08:15:19 -08:00
Dan Williams e4e40e0263 mm: switch to 'define pmd_write' instead of __HAVE_ARCH_PMD_WRITE
In response to compile breakage introduced by a series that added the
pud_write helper to x86, Stephen notes:

    did you consider using the other paradigm:

    In arch include files:
    #define pud_write       pud_write
    static inline int pud_write(pud_t pud)
     .....

    Then in include/asm-generic/pgtable.h:

    #ifndef pud_write
    tatic inline int pud_write(pud_t pud)
    {
            ....
    }
    #endif

    If you had, then the powerpc code would have worked ... ;-) and many
    of the other interfaces in include/asm-generic/pgtable.h are
    protected that way ...

Given that some architecture already define pmd_write() as a macro, it's
a net reduction to drop the definition of __HAVE_ARCH_PMD_WRITE.

Link: http://lkml.kernel.org/r/151129126721.37405.13339850900081557813.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29 18:40:42 -08:00
Vaibhav Jain 7e4d423326 powerpc: Do not assign thread.tidr if already assigned
If set_thread_tidr() is called twice for same task_struct then it will
allocate a new tidr value to it leaving the previous value still
dangling in the vas_thread_ida table.

To fix this the patch changes set_thread_tidr() to check if a tidr
value is already assigned to the task_struct and if yes then returns
zero.

Fixes: ec233ede4c86("powerpc: Add support for setting SPRN_TIDR")
Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
[mpe: Modify to return 0 in the success case, not the TID value]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-29 19:56:18 +11:00
Vaibhav Jain aca7573fde powerpc: Avoid signed to unsigned conversion in set_thread_tidr()
There is an unsafe signed to unsigned conversion in set_thread_tidr()
that may cause an error value to be assigned to SPRN_TIDR register and
used as thread-id.

The issue happens as assign_thread_tidr() returns an int and
thread.tidr is an unsigned-long. So a negative error code returned
from assign_thread_tidr() will fail the error check and gets assigned
as tidr as a large positive value.

To fix this the patch assigns the return value of assign_thread_tidr()
to a temporary int and assigns it to thread.tidr iff its '> 0'.

The patch shouldn't impact the calling convention of set_thread_tidr()
i.e all -ve return-values are error codes and a return value of '0'
indicates success.

Fixes: ec233ede4c86("powerpc: Add support for setting SPRN_TIDR")
Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com>
Reviewed-by: Christophe Lombard clombard@linux.vnet.ibm.com
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-29 19:36:13 +11:00
Al Viro 8153a5ead0 ppc: annotate ->poll() instances
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27 16:19:59 -05:00
Paolo Bonzini a63dd7480d PPC KVM fixes for 4.15
One commit here, that fixes a couple of bugs relating to the patch
 series that enables HPT guests to run on a radix host on POWER9
 systems.  This patch series went upstream in the 4.15 merge window,
 so no stable backport is required.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJaGKSnAAoJEJ2a6ncsY3GfF7MIANLLhznEMrWq8jw4g95WsJU1
 MkDGwp8kIdhIOM9HD6JRskoJZB5Mws2BWlQ5PSaVFxO6v6eUgNLaRb/UBxC1r7gU
 1f9/8corY4BNkezSdJqTL7Xgp13KjTU726OwYAqCPEyCSPEc9ciMyeIgyZuv2dPa
 Pju+u4tnA+9JJyskgNL+/ybOOZwVat91VmNUVRq29zP6+zo1tmIDxrQchy6Bqui/
 7Wg298G+yjAkJ8ktQu69ACk+0oEBGUOcLUlraqGSr9auR+b0nJ1PAGCDRaONdwgE
 +X+OE+t+UC6rU+coUXMwO+Id0X7HMdsLQd3066ODEtD55g8MIVZ126Wt8xDmj5o=
 =GSTh
 -----END PGP SIGNATURE-----

Merge tag 'kvm-ppc-fixes-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-master

PPC KVM fixes for 4.15

One commit here, that fixes a couple of bugs relating to the patch
series that enables HPT guests to run on a radix host on POWER9
systems.  This patch series went upstream in the 4.15 merge window,
so no stable backport is required.
2017-11-27 17:54:13 +01:00
Jan H. Schönherr 20b7035c66 KVM: Let KVM_SET_SIGNAL_MASK work as advertised
KVM API says for the signal mask you set via KVM_SET_SIGNAL_MASK, that
"any unblocked signal received [...] will cause KVM_RUN to return with
-EINTR" and that "the signal will only be delivered if not blocked by
the original signal mask".

This, however, is only true, when the calling task has a signal handler
registered for a signal. If not, signal evaluation is short-circuited for
SIG_IGN and SIG_DFL, and the signal is either ignored without KVM_RUN
returning or the whole process is terminated.

Make KVM_SET_SIGNAL_MASK behave as advertised by utilizing logic similar
to that in do_sigtimedwait() to avoid short-circuiting of signals.

Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-11-27 17:53:47 +01:00
Linus Torvalds 844056fd74 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:

 - The final conversion of timer wheel timers to timer_setup().

   A few manual conversions and a large coccinelle assisted sweep and
   the removal of the old initialization mechanisms and the related
   code.

 - Remove the now unused VSYSCALL update code

 - Fix permissions of /proc/timer_list. I still need to get rid of that
   file completely

 - Rename a misnomed clocksource function and remove a stale declaration

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  m68k/macboing: Fix missed timer callback assignment
  treewide: Remove TIMER_FUNC_TYPE and TIMER_DATA_TYPE casts
  timer: Remove redundant __setup_timer*() macros
  timer: Pass function down to initialization routines
  timer: Remove unused data arguments from macros
  timer: Switch callback prototype to take struct timer_list * argument
  timer: Pass timer_list pointer to callbacks unconditionally
  Coccinelle: Remove setup_timer.cocci
  timer: Remove setup_*timer() interface
  timer: Remove init_timer() interface
  treewide: setup_timer() -> timer_setup() (2 field)
  treewide: setup_timer() -> timer_setup()
  treewide: init_timer() -> setup_timer()
  treewide: Switch DEFINE_TIMER callbacks to struct timer_list *
  s390: cmm: Convert timers to use timer_setup()
  lightnvm: Convert timers to use timer_setup()
  drivers/net: cris: Convert timers to use timer_setup()
  drm/vc4: Convert timers to use timer_setup()
  block/laptop_mode: Convert timers to use timer_setup()
  net/atm/mpc: Avoid open-coded assignment of timer callback function
  ...
2017-11-25 08:37:16 -10:00
Linus Torvalds 83ada03196 powerpc fixes for 4.15 #2
A small batch of fixes, about 50% tagged for stable and the rest for recently
 merged code.
 
 There's one more fix for the >128T handling on hash. Once a process had
 requested a single mmap above 128T we would then always search above 128T. The
 correct behaviour is to consider the hint address in isolation for each mmap
 request.
 
 Then a couple of fixes for the IMC PMU, a missing EXPORT_SYMBOL in VAS, a fix
 for STRICT_KERNEL_RWX on 32-bit, and a fix to correctly identify P9 DD2.1 but in
 code that is currently not used by default.
 
 Thanks to:
   Aneesh Kumar K.V, Christophe Leroy, Madhavan Srinivasan, Sukadev Bhattiprolu.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJaF/VqAAoJEFHr6jzI4aWA994P/3NNXkSASJHjLrIlQAKXtmx9
 lrv1v+6MbPWhyB8Q8LVnnC3Ab2LTHnkccjq2Jw0bP0RQ86HF4mH7Sb7N5Wj0cG+M
 5NioikvGE057ncLfxVhesOK0C9Lhc7Zb+zphXZliYP76IGxwbxorJRepeZctVkyO
 KPMv4eaImdblVn71aoQQSlepON4+/rtiW2yo5u98uCqR+Ttds4J1fiDZ4TNrBYRP
 Ilh6DmA//CWvN+KsGT+brRd/PjEkxQKHyS8px3lxRl4cwCJucXPCik/Gn9t6OiMw
 3S6y1Mu8nrh4z+YepKv6APvl2DEwwXn8w9f85kn+QiE9Qp3Z/wckW9/4LT5FeuKE
 L8E3dKq2NzJ9oDs/20sVbBvVR7CUvBoyWytsXVkmmlC6sVReTrYAJ1UP9HnNvcF6
 be4zYUKusU83uG6saGgchRrPUrD31XKXw8Piv9EoWo1Uz7VgWCkxidclRNocgeDO
 k5VxYnRd9jPsv2pCzXH2YmuQAypGUh12IPTxEOnSt5uzXSXcamZJBLKp5fAJ/9dl
 jD6GlRQMX8JpNRJzxOBLly3CmwQBw2ekOuPLXI+M/ilks66AGK8lp4bg5cWwDGNe
 puzmRJ2mO3dnFlVUHBQ5LyX8ne4yunin1JZB1YQ4xm8yxZbGO2AdypEWMSkPKNPN
 fkrGPlwQ1JwFheMbHHLj
 =gv70
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
 "A small batch of fixes, about 50% tagged for stable and the rest for
  recently merged code.

  There's one more fix for the >128T handling on hash. Once a process
  had requested a single mmap above 128T we would then always search
  above 128T. The correct behaviour is to consider the hint address in
  isolation for each mmap request.

  Then a couple of fixes for the IMC PMU, a missing EXPORT_SYMBOL in
  VAS, a fix for STRICT_KERNEL_RWX on 32-bit, and a fix to correctly
  identify P9 DD2.1 but in code that is currently not used by default.

  Thanks to: Aneesh Kumar K.V, Christophe Leroy, Madhavan Srinivasan,
  Sukadev Bhattiprolu"

* tag 'powerpc-4.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s: Fix Power9 DD2.1 logic in DT CPU features
  powerpc/perf: Fix IMC_MAX_PMU macro
  powerpc/perf: Fix pmu_count to count only nest imc pmus
  powerpc: Fix boot on BOOK3S_32 with CONFIG_STRICT_KERNEL_RWX
  powerpc/perf/imc: Use cpu_to_node() not topology_physical_package_id()
  powerpc/vas: Export chip_to_vas_id()
  powerpc/64s/slice: Use addr limit when computing slice mask
2017-11-24 19:40:12 -10:00
Michael Ellerman 2621e945fb powerpc/kexec: Fix kexec/kdump in P9 guest kernels
The code that cleans up the IAMR/AMOR before kexec'ing failed to
remember that when we're running as a guest AMOR is not writable, it's
hypervisor privileged.

They symptom is that the kexec stops before entering purgatory and
nothing else is seen on the console. If you examine the state of the
system all threads will be in the 0x700 program check handler.

Fix it by making the write to AMOR dependent on HV mode.

Fixes: 1e2a516e89 ("powerpc/kexec: Fix radix to hash kexec due to IAMR/AMOR")
Cc: stable@vger.kernel.org # v4.10+
Reported-by: Yilin Zhang <yilzhang@redhat.com>
Debugged-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-24 16:49:37 +11:00
Mahesh Salgaonkar a3961f824c powerpc/powernv: Fix kexec crashes caused by tlbie tracing
Rebooting into a new kernel with kexec fails in trace_tlbie() which is
called from native_hpte_clear(). This happens if the running kernel
has CONFIG_LOCKDEP enabled. With lockdep enabled, the tracepoints
always execute few RCU checks regardless of whether tracing is on or
off. We are already in the last phase of kexec sequence in real mode
with HILE_BE set. At this point the RCU check ends up in
RCU_LOCKDEP_WARN and causes kexec to fail.

Fix this by not calling trace_tlbie() from native_hpte_clear().

mpe: It's not safe to call trace points at this point in the kexec
path, even if we could avoid the RCU checks/warnings. The only
solution is to not call them.

Fixes: 0428491cba ("powerpc/mm: Trace tlbie(l) instructions")
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-23 23:10:14 +11:00
Paul Mackerras 9aa6825bbb KVM: PPC: Book3S: Eliminate some unnecessary checks
In an excess of caution, commit 6f63e81bda ("KVM: PPC: Book3S: Add
MMIO emulation for FP and VSX instructions", 2017-02-21) included
checks for the case that vcpu->arch.mmio_vsx_copy_nums is less than
zero, even though its type is u8.  This causes a Coverity warning,
so we remove the check for < 0.  We also adjust the associated
comment to be more accurate ("4 or less" rather than "less than 4").

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-23 14:23:34 +11:00
Paul Mackerras c0093f1a38 KVM: PPC: Book3S HV: Fix conditions for starting vcpu
This corrects the test that determines whether a vcpu that has just
become able to run in the guest (e.g. it has just finished handling
a hypercall or hypervisor page fault) and whose virtual core is
already running somewhere as a "piggybacked" vcore can start
immediately or not.  (A piggybacked vcore is one which is executing
along with another vcore as a result of dynamic micro-threading.)

Previously the test tried to lock the piggybacked vcore using
spin_trylock, which would always fail because the vcore was already
locked, and so the vcpu would have to wait until its vcore exited
the guest before it could enter.

In fact the vcpu can enter if its vcore is in VCORE_PIGGYBACK state
and not already exiting (or exited) the guest, so the test in
VCORE_PIGGYBACK state is basically the same as for VCORE_RUNNING
state.

Coverity detected this as a double unlock issue, which it isn't
because the spin_trylock would always fail.  This will fix the
apparent double unlock as well.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-23 14:23:22 +11:00
Paul Mackerras 4fcf361dbd KVM: PPC: Book3S HV: Remove useless statement
This removes a statement that has no effect.  It should have been
removed in commit 898b25b202 ("KVM: PPC: Book3S HV: Simplify dynamic
micro-threading code", 2017-06-22) along with the loop over the
piggy-backed virtual cores.

This issue was reported by Coverity.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-23 14:23:14 +11:00
Paul Mackerras 117647ff93 KVM: PPC: Book3S HV: Fix typo in kvmppc_hv_get_dirty_log_radix()
This fixes a typo where the intent was to assign to 'j' in order to
skip some number of bits in the dirty bitmap for a guest.  The effect
of the typo is benign since it means we just iterate through all the
bits rather than skipping bits which we know will be zero.  This issue
was found by Coverity.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-23 14:23:08 +11:00
Paul Mackerras cda2eaa359 KVM: PPC: Book3S HV: Avoid shifts by negative amounts
The kvmppc_hpte_page_shifts function decodes the actual and base page
sizes for a HPTE, returning -1 if it doesn't recognize the page size
encoding.  This then gets used as a shift amount in various places,
which is undefined behaviour.  This was reported by Coverity.

In fact this should never occur, since we should only get HPTEs in the
HPT which have a recognized page size encoding.  The only place where
this might not be true is in the call to kvmppc_actual_pgsz() near the
beginning of kvmppc_do_h_enter(), where we are validating the HPTE
value passed in from the guest.

So to fix this and eliminate the undefined behaviour, we make
kvmppc_hpte_page_shifts return 0 for unrecognized page size encodings,
and make kvmppc_actual_pgsz() detect that case and return 0 for the
page size, which will then cause kvmppc_do_h_enter() to return an
error and refuse to insert any HPTE with an unrecognized page size
encoding.

To ensure that we don't get undefined behaviour in compute_tlbie_rb(),
we take the 4k page size path for any unrecognized page size encoding.
This should never be hit in practice because it is only used on HPTE
values which have previously been checked for having a recognized
page size encoding.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-23 14:23:00 +11:00
Paul Mackerras ded13fc11b KVM: PPC: Book3S HV: Fix migration and HPT resizing of HPT guests on radix hosts
This fixes two errors that prevent a guest using the HPT MMU from
successfully migrating to a POWER9 host in radix MMU mode, or resizing
its HPT when running on a radix host.

The first bug was that commit 8dc6cca556 ("KVM: PPC: Book3S HV:
Don't rely on host's page size information", 2017-09-11) missed two
uses of hpte_base_page_size(), one in the HPT rehashing code and
one in kvm_htab_write() (which is used on the destination side in
migrating a HPT guest).  Instead we use kvmppc_hpte_base_page_shift().
Having the shift count means that we can use left and right shifts
instead of multiplication and division in a few places.

Along the way, this adds a check in kvm_htab_write() to ensure that the
page size encoding in the incoming HPTEs is recognized, and if not
return an EINVAL error to userspace.

The second bug was that kvm_htab_write was performing some but not all
of the functions of kvmhv_setup_mmu(), resulting in the destination VM
being left in radix mode as far as the hardware is concerned.  The
simplest fix for now is make kvm_htab_write() call
kvmppc_setup_partition_table() like kvmppc_hv_setup_htab_rma() does.
In future it would be better to refactor the code more extensively
to remove the duplication.

Fixes: 8dc6cca556 ("KVM: PPC: Book3S HV: Don't rely on host's page size information")
Fixes: 7a84084c60 ("KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9")
Reported-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Tested-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-11-23 14:17:54 +11:00
Michael Ellerman 4d6c51b107 powerpc/64s: Fix Power9 DD2.1 logic in DT CPU features
I got the logic wrong in the DT CPU features code when I added the
Power9 DD2.1 feature. We should be setting the bit if we detect a
DD2.1, not clearing it if we detect a DD2.0.

This code isn't actually exercised at the moment so nothing is
actually broken.

Fixes: 3ffa9d9e2a ("powerpc/64s: Fix Power9 DD2.0 workarounds by adding DD2.1 feature")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-22 23:17:01 +11:00
Madhavan Srinivasan 73ce9aec65 powerpc/perf: Fix IMC_MAX_PMU macro
IMC_MAX_PMU is used for static storage (per_nest_pmu_arr) which holds
nest pmu information. Current value for the macro is 32 based on
the initial number of nest pmu units supported by the nest microcode.
But going forward, microcode could support more nest units. Instead
of static storage, patch to fix the code to dynamically allocate an
array based on the number of nest imc units found in the device tree.

Fixes:8f95faaac56c1 ('powerpc/powernv: Detect and create IMC device')
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-22 23:04:23 +11:00
Madhavan Srinivasan de34787f10 powerpc/perf: Fix pmu_count to count only nest imc pmus
"pmu_count" in opal_imc_counters_probe() is intended to hold
the number of successful nest imc pmu registerations. But
current code also counts other imc units like core_imc and
thread_imc. Patch add a check to count only nest imc pmus.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-22 23:04:23 +11:00
Christophe Leroy 252eb55816 powerpc: Fix boot on BOOK3S_32 with CONFIG_STRICT_KERNEL_RWX
On powerpc32, patch_instruction() is called by apply_feature_fixups()
which is called from early_init()

There is the following note in front of early_init():
 * Note that the kernel may be running at an address which is different
 * from the address that it was linked at, so we must use RELOC/PTRRELOC
 * to access static data (including strings).  -- paulus

Therefore, slab_is_available() cannot be called yet, and
text_poke_area must be addressed with PTRRELOC()

Fixes: 95902e6c88 ("powerpc/mm: Implement STRICT_KERNEL_RWX on PPC32")
Cc: stable@vger.kernel.org # v4.14+
Reported-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-22 23:04:20 +11:00
Michael Ellerman f3f1dfd600 powerpc/perf/imc: Use cpu_to_node() not topology_physical_package_id()
init_imc_pmu() uses topology_physical_package_id() to detect the
node id of the processor it is on to get local memory, but that's
wrong, and can lead to crashes. Fix it to use cpu_to_node().

Fixes: 885dcd709b ("powerpc/perf: Add nest IMC PMU support")
Cc: stable@vger.kernel.org # v4.14+
Reported-By: Rob Lippert <rlippert@google.com>
Tested-By: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-22 12:24:46 +11:00
Kees Cook 86cb30ec07 treewide: setup_timer() -> timer_setup() (2 field)
This converts all remaining setup_timer() calls that use a nested field
to reach a struct timer_list. Coccinelle does not have an easy way to
match multiple fields, so a new script is needed to change the matches of
"&_E->_timer" into "&_E->_field1._timer" in all the rules.

spatch --very-quiet --all-includes --include-headers \
	-I ./arch/x86/include -I ./arch/x86/include/generated \
	-I ./include -I ./arch/x86/include/uapi \
	-I ./arch/x86/include/generated/uapi -I ./include/uapi \
	-I ./include/generated/uapi --include ./include/linux/kconfig.h \
	--dir . \
	--cocci-file ~/src/data/timer_setup-2fields.cocci

@fix_address_of depends@
expression e;
@@

 setup_timer(
-&(e)
+&e
 , ...)

// Update any raw setup_timer() usages that have a NULL callback, but
// would otherwise match change_timer_function_usage, since the latter
// will update all function assignments done in the face of a NULL
// function initialization in setup_timer().
@change_timer_function_usage_NULL@
expression _E;
identifier _field1;
identifier _timer;
type _cast_data;
@@

(
-setup_timer(&_E->_field1._timer, NULL, _E);
+timer_setup(&_E->_field1._timer, NULL, 0);
|
-setup_timer(&_E->_field1._timer, NULL, (_cast_data)_E);
+timer_setup(&_E->_field1._timer, NULL, 0);
|
-setup_timer(&_E._field1._timer, NULL, &_E);
+timer_setup(&_E._field1._timer, NULL, 0);
|
-setup_timer(&_E._field1._timer, NULL, (_cast_data)&_E);
+timer_setup(&_E._field1._timer, NULL, 0);
)

@change_timer_function_usage@
expression _E;
identifier _field1;
identifier _timer;
struct timer_list _stl;
identifier _callback;
type _cast_func, _cast_data;
@@

(
-setup_timer(&_E->_field1._timer, _callback, _E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, &_callback, _E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, _callback, (_cast_data)_E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, &_callback, (_cast_data)_E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, (_cast_func)_callback, _E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, (_cast_func)&_callback, _E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, _callback, (_cast_data)_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, _callback, (_cast_data)&_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, &_callback, (_cast_data)_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, &_callback, (_cast_data)&_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, (_cast_func)_callback, (_cast_data)&_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, (_cast_func)&_callback, (_cast_data)&_E);
+timer_setup(&_E._field1._timer, _callback, 0);
|
 _E->_field1._timer@_stl.function = _callback;
|
 _E->_field1._timer@_stl.function = &_callback;
|
 _E->_field1._timer@_stl.function = (_cast_func)_callback;
|
 _E->_field1._timer@_stl.function = (_cast_func)&_callback;
|
 _E._field1._timer@_stl.function = _callback;
|
 _E._field1._timer@_stl.function = &_callback;
|
 _E._field1._timer@_stl.function = (_cast_func)_callback;
|
 _E._field1._timer@_stl.function = (_cast_func)&_callback;
)

// callback(unsigned long arg)
@change_callback_handle_cast
 depends on change_timer_function_usage@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._field1;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
identifier _handle;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
(
	... when != _origarg
	_handletype *_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _field1._timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle =
-(void *)_origarg;
+from_timer(_handle, t, _field1._timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _field1._timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(void *)_origarg;
+from_timer(_handle, t, _field1._timer);
	... when != _origarg
)
 }

// callback(unsigned long arg) without existing variable
@change_callback_handle_cast_no_arg
 depends on change_timer_function_usage &&
                     !change_callback_handle_cast@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._field1;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
+	_handletype *_origarg = from_timer(_origarg, t, _field1._timer);
+
	... when != _origarg
-	(_handletype *)_origarg
+	_origarg
	... when != _origarg
 }

// Avoid already converted callbacks.
@match_callback_converted
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
	    !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier t;
@@

 void _callback(struct timer_list *t)
 { ... }

// callback(struct something *handle)
@change_callback_handle_arg
 depends on change_timer_function_usage &&
	    !match_callback_converted &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._field1;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
@@

 void _callback(
-_handletype *_handle
+struct timer_list *t
 )
 {
+	_handletype *_handle = from_timer(_handle, t, _field1._timer);
	...
 }

// If change_callback_handle_arg ran on an empty function, remove
// the added handler.
@unchange_callback_handle_arg
 depends on change_timer_function_usage &&
	    change_callback_handle_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._field1;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
identifier t;
@@

 void _callback(struct timer_list *t)
 {
-	_handletype *_handle = from_timer(_handle, t, _field1._timer);
 }

// We only want to refactor the setup_timer() data argument if we've found
// the matching callback. This undoes changes in change_timer_function_usage.
@unchange_timer_function_usage
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg &&
	    !change_callback_handle_arg@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._field1;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type change_timer_function_usage._cast_data;
@@

(
-timer_setup(&_E->_field1._timer, _callback, 0);
+setup_timer(&_E->_field1._timer, _callback, (_cast_data)_E);
|
-timer_setup(&_E._field1._timer, _callback, 0);
+setup_timer(&_E._field1._timer, _callback, (_cast_data)&_E);
)

// If we fixed a callback from a .function assignment, fix the
// assignment cast now.
@change_timer_function_assignment
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._field1;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_func;
typedef TIMER_FUNC_TYPE;
@@

(
 _E->_field1._timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_field1._timer.function =
-&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_field1._timer.function =
-(_cast_func)_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_field1._timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._field1._timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._field1._timer.function =
-&_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._field1._timer.function =
-(_cast_func)_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._field1._timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
)

// Sometimes timer functions are called directly. Replace matched args.
@change_timer_function_calls
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression _E;
identifier change_timer_function_usage._field1;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_data;
@@

 _callback(
(
-(_cast_data)_E
+&_E->_field1._timer
|
-(_cast_data)&_E
+&_E._field1._timer
|
-_E
+&_E->_field1._timer
)
 )

// If a timer has been configured without a data argument, it can be
// converted without regard to the callback argument, since it is unused.
@match_timer_function_unused_data@
expression _E;
identifier _field1;
identifier _timer;
identifier _callback;
@@

(
-setup_timer(&_E->_field1._timer, _callback, 0);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, _callback, 0L);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E->_field1._timer, _callback, 0UL);
+timer_setup(&_E->_field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, _callback, 0);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, _callback, 0L);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_E._field1._timer, _callback, 0UL);
+timer_setup(&_E._field1._timer, _callback, 0);
|
-setup_timer(&_field1._timer, _callback, 0);
+timer_setup(&_field1._timer, _callback, 0);
|
-setup_timer(&_field1._timer, _callback, 0L);
+timer_setup(&_field1._timer, _callback, 0);
|
-setup_timer(&_field1._timer, _callback, 0UL);
+timer_setup(&_field1._timer, _callback, 0);
|
-setup_timer(_field1._timer, _callback, 0);
+timer_setup(_field1._timer, _callback, 0);
|
-setup_timer(_field1._timer, _callback, 0L);
+timer_setup(_field1._timer, _callback, 0);
|
-setup_timer(_field1._timer, _callback, 0UL);
+timer_setup(_field1._timer, _callback, 0);
)

@change_callback_unused_data
 depends on match_timer_function_unused_data@
identifier match_timer_function_unused_data._callback;
type _origtype;
identifier _origarg;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *unused
 )
 {
	... when != _origarg
 }

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:57:09 -08:00
Kees Cook e99e88a9d2 treewide: setup_timer() -> timer_setup()
This converts all remaining cases of the old setup_timer() API into using
timer_setup(), where the callback argument is the structure already
holding the struct timer_list. These should have no behavioral changes,
since they just change which pointer is passed into the callback with
the same available pointers after conversion. It handles the following
examples, in addition to some other variations.

Casting from unsigned long:

    void my_callback(unsigned long data)
    {
        struct something *ptr = (struct something *)data;
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, ptr);

and forced object casts:

    void my_callback(struct something *ptr)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);

become:

    void my_callback(struct timer_list *t)
    {
        struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

Direct function assignments:

    void my_callback(unsigned long data)
    {
        struct something *ptr = (struct something *)data;
    ...
    }
    ...
    ptr->my_timer.function = my_callback;

have a temporary cast added, along with converting the args:

    void my_callback(struct timer_list *t)
    {
        struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;

And finally, callbacks without a data assignment:

    void my_callback(unsigned long data)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, 0);

have their argument renamed to verify they're unused during conversion:

    void my_callback(struct timer_list *unused)
    {
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

The conversion is done with the following Coccinelle script:

spatch --very-quiet --all-includes --include-headers \
	-I ./arch/x86/include -I ./arch/x86/include/generated \
	-I ./include -I ./arch/x86/include/uapi \
	-I ./arch/x86/include/generated/uapi -I ./include/uapi \
	-I ./include/generated/uapi --include ./include/linux/kconfig.h \
	--dir . \
	--cocci-file ~/src/data/timer_setup.cocci

@fix_address_of@
expression e;
@@

 setup_timer(
-&(e)
+&e
 , ...)

// Update any raw setup_timer() usages that have a NULL callback, but
// would otherwise match change_timer_function_usage, since the latter
// will update all function assignments done in the face of a NULL
// function initialization in setup_timer().
@change_timer_function_usage_NULL@
expression _E;
identifier _timer;
type _cast_data;
@@

(
-setup_timer(&_E->_timer, NULL, _E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E->_timer, NULL, (_cast_data)_E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, &_E);
+timer_setup(&_E._timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, (_cast_data)&_E);
+timer_setup(&_E._timer, NULL, 0);
)

@change_timer_function_usage@
expression _E;
identifier _timer;
struct timer_list _stl;
identifier _callback;
type _cast_func, _cast_data;
@@

(
-setup_timer(&_E->_timer, _callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
 _E->_timer@_stl.function = _callback;
|
 _E->_timer@_stl.function = &_callback;
|
 _E->_timer@_stl.function = (_cast_func)_callback;
|
 _E->_timer@_stl.function = (_cast_func)&_callback;
|
 _E._timer@_stl.function = _callback;
|
 _E._timer@_stl.function = &_callback;
|
 _E._timer@_stl.function = (_cast_func)_callback;
|
 _E._timer@_stl.function = (_cast_func)&_callback;
)

// callback(unsigned long arg)
@change_callback_handle_cast
 depends on change_timer_function_usage@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
identifier _handle;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
(
	... when != _origarg
	_handletype *_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
)
 }

// callback(unsigned long arg) without existing variable
@change_callback_handle_cast_no_arg
 depends on change_timer_function_usage &&
                     !change_callback_handle_cast@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
+	_handletype *_origarg = from_timer(_origarg, t, _timer);
+
	... when != _origarg
-	(_handletype *)_origarg
+	_origarg
	... when != _origarg
 }

// Avoid already converted callbacks.
@match_callback_converted
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
	    !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier t;
@@

 void _callback(struct timer_list *t)
 { ... }

// callback(struct something *handle)
@change_callback_handle_arg
 depends on change_timer_function_usage &&
	    !match_callback_converted &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
@@

 void _callback(
-_handletype *_handle
+struct timer_list *t
 )
 {
+	_handletype *_handle = from_timer(_handle, t, _timer);
	...
 }

// If change_callback_handle_arg ran on an empty function, remove
// the added handler.
@unchange_callback_handle_arg
 depends on change_timer_function_usage &&
	    change_callback_handle_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
identifier t;
@@

 void _callback(struct timer_list *t)
 {
-	_handletype *_handle = from_timer(_handle, t, _timer);
 }

// We only want to refactor the setup_timer() data argument if we've found
// the matching callback. This undoes changes in change_timer_function_usage.
@unchange_timer_function_usage
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg &&
	    !change_callback_handle_arg@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type change_timer_function_usage._cast_data;
@@

(
-timer_setup(&_E->_timer, _callback, 0);
+setup_timer(&_E->_timer, _callback, (_cast_data)_E);
|
-timer_setup(&_E._timer, _callback, 0);
+setup_timer(&_E._timer, _callback, (_cast_data)&_E);
)

// If we fixed a callback from a .function assignment, fix the
// assignment cast now.
@change_timer_function_assignment
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_func;
typedef TIMER_FUNC_TYPE;
@@

(
 _E->_timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-(_cast_func)_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-&_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-(_cast_func)_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
)

// Sometimes timer functions are called directly. Replace matched args.
@change_timer_function_calls
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression _E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_data;
@@

 _callback(
(
-(_cast_data)_E
+&_E->_timer
|
-(_cast_data)&_E
+&_E._timer
|
-_E
+&_E->_timer
)
 )

// If a timer has been configured without a data argument, it can be
// converted without regard to the callback argument, since it is unused.
@match_timer_function_unused_data@
expression _E;
identifier _timer;
identifier _callback;
@@

(
-setup_timer(&_E->_timer, _callback, 0);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0L);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0UL);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0L);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0UL);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0L);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0UL);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0L);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0UL);
+timer_setup(_timer, _callback, 0);
)

@change_callback_unused_data
 depends on match_timer_function_unused_data@
identifier match_timer_function_unused_data._callback;
type _origtype;
identifier _origarg;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *unused
 )
 {
	... when != _origarg
 }

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:57:07 -08:00
Sukadev Bhattiprolu 62b49c4210 powerpc/vas: Export chip_to_vas_id()
Export the symbol chip_to_vas_id() to fix a build failure when
CONFIG_CRYPTO_DEV_NX_COMPRESS_POWERNV=m.

Fixes: d4ef61b5e8 ("powerpc/vas, nx-842: Define and use chip_to_vas_id()")
Reported-by: Haren Myneni <hbabu@us.ibm.com>
Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-21 21:02:26 +11:00
Aneesh Kumar K.V 7a06c66835 powerpc/64s/slice: Use addr limit when computing slice mask
While computing slice mask for the free area we need make sure we only
search in the addr limit applicable for this mmap. We update the
slb_addr_limit after we request for a mmap above 128TB. But the
following mmap request with hint addr below 128TB should still limit
its search to below 128TB. ie. we should not use slb_addr_limit to
compute slice mask in this case. Instead, we should derive high addr
limit based on the mmap hint addr value.

Fixes: f4ea6dcb08 ("powerpc/mm: Enable mappings above 128TB")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-20 19:28:25 +11:00
Linus Torvalds fa7f578076 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - a bit more MM

 - procfs updates

 - dynamic-debug fixes

 - lib/ updates

 - checkpatch

 - epoll

 - nilfs2

 - signals

 - rapidio

 - PID management cleanup and optimization

 - kcov updates

 - sysvipc updates

 - quite a few misc things all over the place

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
  EXPERT Kconfig menu: fix broken EXPERT menu
  include/asm-generic/topology.h: remove unused parent_node() macro
  arch/tile/include/asm/topology.h: remove unused parent_node() macro
  arch/sparc/include/asm/topology_64.h: remove unused parent_node() macro
  arch/sh/include/asm/topology.h: remove unused parent_node() macro
  arch/ia64/include/asm/topology.h: remove unused parent_node() macro
  drivers/pcmcia/sa1111_badge4.c: avoid unused function warning
  mm: add infrastructure for get_user_pages_fast() benchmarking
  sysvipc: make get_maxid O(1) again
  sysvipc: properly name ipc_addid() limit parameter
  sysvipc: duplicate lock comments wrt ipc_addid()
  sysvipc: unteach ids->next_id for !CHECKPOINT_RESTORE
  initramfs: use time64_t timestamps
  drivers/watchdog: make use of devm_register_reboot_notifier()
  kernel/reboot.c: add devm_register_reboot_notifier()
  kcov: update documentation
  Makefile: support flag -fsanitizer-coverage=trace-cmp
  kcov: support comparison operands collection
  kcov: remove pointless current != NULL check
  kernel/panic.c: add TAINT_AUX
  ...
2017-11-17 16:56:17 -08:00
Gargi Sharma 95846ecf9d pid: replace pid bitmap implementation with IDR API
Patch series "Replacing PID bitmap implementation with IDR API", v4.

This series replaces kernel bitmap implementation of PID allocation with
IDR API.  These patches are written to simplify the kernel by replacing
custom code with calls to generic code.

The following are the stats for pid and pid_namespace object files
before and after the replacement.  There is a noteworthy change between
the IDR and bitmap implementation.

Before
   text       data        bss        dec        hex    filename
   8447       3894         64      12405       3075    kernel/pid.o
After
   text       data        bss        dec        hex    filename
   3397        304          0       3701        e75    kernel/pid.o

Before
   text       data        bss        dec        hex    filename
   5692       1842        192       7726       1e2e    kernel/pid_namespace.o
After
   text       data        bss        dec        hex    filename
   2854        216         16       3086        c0e    kernel/pid_namespace.o

The following are the stats for ps, pstree and calling readdir on /proc
for 10,000 processes.

ps:
        With IDR API    With bitmap
real    0m1.479s        0m2.319s
user    0m0.070s        0m0.060s
sys     0m0.289s        0m0.516s

pstree:
        With IDR API    With bitmap
real    0m1.024s        0m1.794s
user    0m0.348s        0m0.612s
sys     0m0.184s        0m0.264s

proc:
        With IDR API    With bitmap
real    0m0.059s        0m0.074s
user    0m0.000s        0m0.004s
sys     0m0.016s        0m0.016s

This patch (of 2):

Replace the current bitmap implementation for Process ID allocation.
Functions that are no longer required, for example, free_pidmap(),
alloc_pidmap(), etc.  are removed.  The rest of the functions are
modified to use the IDR API.  The change was made to make the PID
allocation less complex by replacing custom code with calls to generic
API.

[gs051095@gmail.com: v6]
  Link: http://lkml.kernel.org/r/1507760379-21662-2-git-send-email-gs051095@gmail.com
[avagin@openvz.org: restore the old behaviour of the ns_last_pid sysctl]
  Link: http://lkml.kernel.org/r/20171106183144.16368-1-avagin@openvz.org
Link: http://lkml.kernel.org/r/1507583624-22146-2-git-send-email-gs051095@gmail.com
Signed-off-by: Gargi Sharma <gs051095@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:03 -08:00
Linus Torvalds 5a3e0b196b File locking related changes for v4.15
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaDuoWAAoJEAAOaEEZVoIVXEQP/jQYoU9hgvEj8j3ZIgi56SDJ
 pR45w2zcJz2/uU43DEKyShyLgsuoBbJQ3l/gGBH/tl+xGm9NzB0gatoEu9GmKNYz
 /IN6/vUFnoIAUyD+iMZbpmsYKIkz0z2YJo261IfspAwIft/cvHJnYYGQrP9YXg9F
 c7bdDuANTKocdQigc4BQyOe3OfIBGfTwJhuakO+1yuZmGOVNyxEcdYbMM8FiTfc8
 +62kvQQ3t7WMqSbM8M0QdGcYQjG0EwcVAuV7COurLJIva7hUkVel32MVUjoFcf28
 BnRu2ztFJCubm1HA85twlJDtpeXbcMqrUl/CcwRMpwDaePd5GVB1h5iKqbZ51BZ1
 fWT2STmt+8hY2B5eiXoYEaG3B7ZRr+r0oroxqOxpiZ/m4AVeouF+gPGv+NV5zgvD
 NGWC0MdklIJ4xaC99NEeP6kBhz0M74VKymFCTeHkVg9m4TqDepNvitKed0qagw19
 uw8seei7TOTm4o117+l55NHmyfTHXFO4U0WLTJyeZcoEnUs0rOcHeqyy0RwCBMrK
 W2fJtdBLFr+tBIIrID4TnPhhYtSvIPjz+FpiRDobqhgvMva/PIvLGTWK4unrgIjG
 ZQ7YGnwWda8GjqKhgZacn/BSXyJzOAF9hJp0mz2ORaOxaMarEV55duiZufCvGuZw
 uUQWRCKuQX7Oi05i9jXp
 =fCeF
 -----END PGP SIGNATURE-----

Merge tag 'locks-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking update from Jeff Layton:
 "A couple of fixes for a patch that went into v4.14, and the bug report
  just came in a few days ago.. It passes my (minimal) testing, and has
  been in linux-next for a few days now.

  I also would like to get my address changed in MAINTAINERS to clear
  that hurdle"

* tag 'locks-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fcntl: don't cap l_start and l_end values for F_GETLK64 in compat syscall
  fcntl: don't leak fd reference when fixup_compat_flock fails
  MAINTAINERS: s/jlayton@poochiereds.net/jlayton@kernel.org/
2017-11-17 13:21:58 -08:00
Linus Torvalds 93f30c73ec Merge branch 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull compat and uaccess updates from Al Viro:

 - {get,put}_compat_sigset() series

 - assorted compat ioctl stuff

 - more set_fs() elimination

 - a few more timespec64 conversions

 - several removals of pointless access_ok() in places where it was
   followed only by non-__ variants of primitives

* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (24 commits)
  coredump: call do_unlinkat directly instead of sys_unlink
  fs: expose do_unlinkat for built-in callers
  ext4: take handling of EXT4_IOC_GROUP_ADD into a helper, get rid of set_fs()
  ipmi: get rid of pointless access_ok()
  pi433: sanitize ioctl
  cxlflash: get rid of pointless access_ok()
  mtdchar: get rid of pointless access_ok()
  r128: switch compat ioctls to drm_ioctl_kernel()
  selection: get rid of field-by-field copyin
  VT_RESIZEX: get rid of field-by-field copyin
  i2c compat ioctls: move to ->compat_ioctl()
  sched_rr_get_interval(): move compat to native, get rid of set_fs()
  mips: switch to {get,put}_compat_sigset()
  sparc: switch to {get,put}_compat_sigset()
  s390: switch to {get,put}_compat_sigset()
  ppc: switch to {get,put}_compat_sigset()
  parisc: switch to {get,put}_compat_sigset()
  get_compat_sigset()
  get rid of {get,put}_compat_itimerspec()
  io_getevents: Use timespec64 to represent timeouts
  ...
2017-11-17 11:54:55 -08:00
Linus Torvalds 974aa5630b First batch of KVM changes for 4.15
Common:
  - Python 3 support in kvm_stat
 
  - Accounting of slabs to kmemcg
 
 ARM:
  - Optimized arch timer handling for KVM/ARM
 
  - Improvements to the VGIC ITS code and introduction of an ITS reset
    ioctl
 
  - Unification of the 32-bit fault injection logic
 
  - More exact external abort matching logic
 
 PPC:
  - Support for running hashed page table (HPT) MMU mode on a host that
    is using the radix MMU mode;  single threaded mode on POWER 9 is
    added as a pre-requisite
 
  - Resolution of merge conflicts with the last second 4.14 HPT fixes
 
  - Fixes and cleanups
 
 s390:
  - Some initial preparation patches for exitless interrupts and crypto
 
  - New capability for AIS migration
 
  - Fixes
 
 x86:
  - Improved emulation of LAPIC timer mode changes, MCi_STATUS MSRs, and
    after-reset state
 
  - Refined dependencies for VMX features
 
  - Fixes for nested SMI injection
 
  - A lot of cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJaDayXAAoJEED/6hsPKofo/3UH/3HvlcHt+ADTkCU1/iiKAs+i
 0zngIOXIxgHDnV0ww6bV+Znww0BzTYgKCAXX76z603jdpDwG/pzQQcbLDF5ZoJnD
 sQtF10gZinWaRsHlfbLqjrHGL2pGDHO1UKBKLJ0bAIyORPZBxs7i+VmrY/blnr9c
 0wsybJ8RbvwAxjsDL5jeX/z4NehPupmKUc4Lf0eZdSHwVOf9sjn+MP6jJ0r2JcIb
 D+zddPBiLStzN97t4gZpQsrlj3LKrDS+6hY+1TjSvlh+yHKFVFh58VhLm4DuDeb5
 bYOAlWJ/gAWEzfvr5Ld+Nd7SqWWn/14logPkQ4gcU4BI/neAOzk4c6hJfCHl1nk=
 =593n
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář:
 "First batch of KVM changes for 4.15

  Common:
   - Python 3 support in kvm_stat
   - Accounting of slabs to kmemcg

  ARM:
   - Optimized arch timer handling for KVM/ARM
   - Improvements to the VGIC ITS code and introduction of an ITS reset
     ioctl
   - Unification of the 32-bit fault injection logic
   - More exact external abort matching logic

  PPC:
   - Support for running hashed page table (HPT) MMU mode on a host that
     is using the radix MMU mode; single threaded mode on POWER 9 is
     added as a pre-requisite
   - Resolution of merge conflicts with the last second 4.14 HPT fixes
   - Fixes and cleanups

  s390:
   - Some initial preparation patches for exitless interrupts and crypto
   - New capability for AIS migration
   - Fixes

  x86:
   - Improved emulation of LAPIC timer mode changes, MCi_STATUS MSRs,
     and after-reset state
   - Refined dependencies for VMX features
   - Fixes for nested SMI injection
   - A lot of cleanups"

* tag 'kvm-4.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (89 commits)
  KVM: s390: provide a capability for AIS state migration
  KVM: s390: clear_io_irq() requests are not expected for adapter interrupts
  KVM: s390: abstract conversion between isc and enum irq_types
  KVM: s390: vsie: use common code functions for pinning
  KVM: s390: SIE considerations for AP Queue virtualization
  KVM: s390: document memory ordering for kvm_s390_vcpu_wakeup
  KVM: PPC: Book3S HV: Cosmetic post-merge cleanups
  KVM: arm/arm64: fix the incompatible matching for external abort
  KVM: arm/arm64: Unify 32bit fault injection
  KVM: arm/arm64: vgic-its: Implement KVM_DEV_ARM_ITS_CTRL_RESET
  KVM: arm/arm64: Document KVM_DEV_ARM_ITS_CTRL_RESET
  KVM: arm/arm64: vgic-its: Free caches when GITS_BASER Valid bit is cleared
  KVM: arm/arm64: vgic-its: New helper functions to free the caches
  KVM: arm/arm64: vgic-its: Remove kvm_its_unmap_device
  arm/arm64: KVM: Load the timer state when enabling the timer
  KVM: arm/arm64: Rework kvm_timer_should_fire
  KVM: arm/arm64: Get rid of kvm_timer_flush_hwstate
  KVM: arm/arm64: Avoid phys timer emulation in vcpu entry/exit
  KVM: arm/arm64: Move phys_timer_emulate function
  KVM: arm/arm64: Use kvm_arm_timer_set/get_reg for guest register traps
  ...
2017-11-16 13:00:24 -08:00
Linus Torvalds 5b0e2cb020 powerpc updates for 4.15
Non-highlights:
 
  - Five fixes for the >128T address space handling, both to fix bugs in our
    implementation and to bring the semantics exactly into line with x86.
 
 Highlights:
 
  - Support for a new OPAL call on bare metal machines which gives us a true NMI
    (ie. is not masked by MSR[EE]=0) for debugging etc.
 
  - Support for Power9 DD2 in the CXL driver.
 
  - Improvements to machine check handling so that uncorrectable errors can be
    reported into the generic memory_failure() machinery.
 
  - Some fixes and improvements for VPHN, which is used under PowerVM to notify
    the Linux partition of topology changes.
 
  - Plumbing to enable TM (transactional memory) without suspend on some Power9
    processors (PPC_FEATURE2_HTM_NO_SUSPEND).
 
  - Support for emulating vector loads form cache-inhibited memory, on some
    Power9 revisions.
 
  - Disable the fast-endian switch "syscall" by default (behind a CONFIG), we
    believe it has never had any users.
 
  - A major rework of the API drivers use when initiating and waiting for long
    running operations performed by OPAL firmware, and changes to the
    powernv_flash driver to use the new API.
 
  - Several fixes for the handling of FP/VMX/VSX while processes are using
    transactional memory.
 
  - Optimisations of TLB range flushes when using the radix MMU on Power9.
 
  - Improvements to the VAS facility used to access coprocessors on Power9, and
    related improvements to the way the NX crypto driver handles requests.
 
  - Implementation of PMEM_API and UACCESS_FLUSHCACHE for 64-bit.
 
 Thanks to:
   Alexey Kardashevskiy, Alistair Popple, Allen Pais, Andrew Donnellan, Aneesh
   Kumar K.V, Arnd Bergmann, Balbir Singh, Benjamin Herrenschmidt, Breno Leitao,
   Christophe Leroy, Christophe Lombard, Cyril Bur, Frederic Barrat, Gautham R.
   Shenoy, Geert Uytterhoeven, Guilherme G. Piccoli, Gustavo Romero, Haren
   Myneni, Joel Stanley, Kamalesh Babulal, Kautuk Consul, Markus Elfring, Masami
   Hiramatsu, Michael Bringmann, Michael Neuling, Michal Suchanek, Naveen N. Rao,
   Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pedro Miraglia Franco de
   Carvalho, Philippe Bergheaud, Sandipan Das, Seth Forshee, Shriya, Stephen
   Rothwell, Stewart Smith, Sukadev Bhattiprolu, Tyrel Datwyler, Vaibhav Jain,
   Vaidyanathan Srinivasan, William A. Kennington III.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJaDXGuAAoJEFHr6jzI4aWAEqwP/0TA35KFAK6wqfkCf67z4q+O
 I+5piI4eDV4jdCakfoIN1JfjhQRULNePSoCHTccan30mu/bm30p69xtOLL2/h5xH
 Mhz/eDBAOo0lrT20nyZfYMW3FnM66wnNf++qJ0O+8L052r4WOB02J0k1uM1ST01D
 5Lb5mUoxRLRzCgKRYAYWJifn+IFPUB9NMsvMTym94krAFlIjIzMEQXhDoln+jJMr
 QmY5f1BTA/fLfXobn0zwoc/C1oa2PUtxd+rxbwGrLoZ6G843mMqUi90SMr5ybhXp
 RzepnBTj4by3vOsnk/X1mANyaZfLsunp75FwnjHdPzKrAS/TuPp8D/iSxxE/PzEq
 cLwJFBnFXSgQMefDErXxhHSDz2dAg5r14rsTpDcq2Ko8TPV4rPsuSfmbd9Txekb0
 yWHsjoJUBBMl2QcWqIHl+AlV8j1RklF6solcTBcGnH1CZJMfa05VKXV7xGEvOHa0
 RJ+/xPyR9KjoB/SUp++9Vmx/M6SwQYFOJlr3Zpg9LNtR8WpoPYu1E6eO+u1Hhzny
 eJqaNstH+i+VdY9eqszkAsEBh8o9M/+b+7Wx7TetvU+v368CbXtgFYs9qy2oZjPF
 t9sY/BHaHZ8eZ7I00an77a0fVV5B1PVASUtIz5CqkwGpMvX6Z6W2K/XUUFI61kuu
 E06HS6Ht8UPJAzrAPUMl
 =Rq81
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc updates from Michael Ellerman:
 "A bit of a small release, I suspect in part due to me travelling for
  KS. But my backlog of patches to review is smaller than usual, so I
  think in part folks just didn't send as much this cycle.

  Non-highlights:

   - Five fixes for the >128T address space handling, both to fix bugs
     in our implementation and to bring the semantics exactly into line
     with x86.

  Highlights:

   - Support for a new OPAL call on bare metal machines which gives us a
     true NMI (ie. is not masked by MSR[EE]=0) for debugging etc.

   - Support for Power9 DD2 in the CXL driver.

   - Improvements to machine check handling so that uncorrectable errors
     can be reported into the generic memory_failure() machinery.

   - Some fixes and improvements for VPHN, which is used under PowerVM
     to notify the Linux partition of topology changes.

   - Plumbing to enable TM (transactional memory) without suspend on
     some Power9 processors (PPC_FEATURE2_HTM_NO_SUSPEND).

   - Support for emulating vector loads form cache-inhibited memory, on
     some Power9 revisions.

   - Disable the fast-endian switch "syscall" by default (behind a
     CONFIG), we believe it has never had any users.

   - A major rework of the API drivers use when initiating and waiting
     for long running operations performed by OPAL firmware, and changes
     to the powernv_flash driver to use the new API.

   - Several fixes for the handling of FP/VMX/VSX while processes are
     using transactional memory.

   - Optimisations of TLB range flushes when using the radix MMU on
     Power9.

   - Improvements to the VAS facility used to access coprocessors on
     Power9, and related improvements to the way the NX crypto driver
     handles requests.

   - Implementation of PMEM_API and UACCESS_FLUSHCACHE for 64-bit.

  Thanks to: Alexey Kardashevskiy, Alistair Popple, Allen Pais, Andrew
  Donnellan, Aneesh Kumar K.V, Arnd Bergmann, Balbir Singh, Benjamin
  Herrenschmidt, Breno Leitao, Christophe Leroy, Christophe Lombard,
  Cyril Bur, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven,
  Guilherme G. Piccoli, Gustavo Romero, Haren Myneni, Joel Stanley,
  Kamalesh Babulal, Kautuk Consul, Markus Elfring, Masami Hiramatsu,
  Michael Bringmann, Michael Neuling, Michal Suchanek, Naveen N. Rao,
  Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pedro Miraglia
  Franco de Carvalho, Philippe Bergheaud, Sandipan Das, Seth Forshee,
  Shriya, Stephen Rothwell, Stewart Smith, Sukadev Bhattiprolu, Tyrel
  Datwyler, Vaibhav Jain, Vaidyanathan Srinivasan, and William A.
  Kennington III"

* tag 'powerpc-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (151 commits)
  powerpc/64s: Fix Power9 DD2.0 workarounds by adding DD2.1 feature
  powerpc/64s: Fix masking of SRR1 bits on instruction fault
  powerpc/64s: mm_context.addr_limit is only used on hash
  powerpc/64s/radix: Fix 128TB-512TB virtual address boundary case allocation
  powerpc/64s/hash: Allow MAP_FIXED allocations to cross 128TB boundary
  powerpc/64s/hash: Fix fork() with 512TB process address space
  powerpc/64s/hash: Fix 128TB-512TB virtual address boundary case allocation
  powerpc/64s/hash: Fix 512T hint detection to use >= 128T
  powerpc: Fix DABR match on hash based systems
  powerpc/signal: Properly handle return value from uprobe_deny_signal()
  powerpc/fadump: use kstrtoint to handle sysfs store
  powerpc/lib: Implement UACCESS_FLUSHCACHE API
  powerpc/lib: Implement PMEM API
  powerpc/powernv/npu: Don't explicitly flush nmmu tlb
  powerpc/powernv/npu: Use flush_all_mm() instead of flush_tlb_mm()
  powerpc/powernv/idle: Round up latency and residency values
  powerpc/kprobes: refactor kprobe_lookup_name for safer string operations
  powerpc/kprobes: Blacklist emulate_update_regs() from kprobes
  powerpc/kprobes: Do not disable interrupts for optprobes and kprobes_on_ftrace
  powerpc/kprobes: Disable preemption before invoking probe handler for optprobes
  ...
2017-11-16 12:47:46 -08:00
Linus Torvalds e60e1ee606 main drm pull request for v4.15
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaCm8RAAoJEAx081l5xIa+zX0QAJSm31kCG3vdw2CNiRx25L3q
 3hcsEOgAjVJ9FQVGKFWjzb8TK35tSqtNx5kWIj0VGaIfBE5Bdg5SLLgKKUYas8rY
 4LaphqICq2uxu2BNa2tpiar/sHhAnuozwQ4czpVWXzlaISnb9yYzRl7gMuyUVGkx
 +Gih5VUhLmQC0HsRTLJ3vaZQoUsLAl2gAjKcWa1bx57j2S+iKOPfsLaq7VYo+y1I
 Njc+iSGqMhJzRLXVkxL2lQKaslp7R38Bbh5K4Kvyjkm4Aq7zErOF6irpOXKMcrGl
 mwnr89vf1G9thjikrBaXpKnuvdbWYveoN/ORMlTdCfxkFnChHLnm3bd7NJ49RXDN
 Hv/Iq9YYjmZ9GTatxnx7lWtmXnZXC5he1yn1JAuz/yt7/0b/Wx+Mu/wEpBXYNFTd
 1AZdD586i+AmPo3yDkqH9nBu8JC0W0AnS9VZma4LVvZOP2UfJmj5Im1CLHItbGDN
 FnUCkwyD/lJUUk+WgT+w/GOMJgmFHDiFFl4tFtYVVjrUirpCFVguSKG9xuv6tT8P
 8iRsoP7RrcmDN9ojN2SEHwcpsAv3HnKkDv+9+GIbWnrGsSbCPq8Qm+JDSvf4h22I
 K5lwNpJrcpSKI+q10L7w2xliTBwb98sJkWGA/rssomrdBOWteGZAyqFRYAVgQ+mJ
 x/nJurIqQYh2KQN9+uLG
 =xVV2
 -----END PGP SIGNATURE-----

Merge tag 'drm-for-v4.15' of git://people.freedesktop.org/~airlied/linux

Pull drm updates from Dave Airlie:
 "This is the main drm pull request for v4.15.

  Core:
   - Atomic object lifetime fixes
   - Atomic iterator improvements
   - Sparse/smatch fixes
   - Legacy kms ioctls to be interruptible
   - EDID override improvements
   - fb/gem helper cleanups
   - Simple outreachy patches
   - Documentation improvements
   - Fix dma-buf rcu races
   - DRM mode object leasing for improving VR use cases.
   - vgaarb improvements for non-x86 platforms.

  New driver:
   - tve200: Faraday Technology TVE200 block.

     This "TV Encoder" encodes a ITU-T BT.656 stream and can be found in
     the StorLink SL3516 (later Cortina Systems CS3516) as well as the
     Grain Media GM8180.

  New bridges:
   - SiI9234 support

  New panels:
   - S6E63J0X03, OTM8009A, Seiko 43WVF1G, 7" rpi touch panel, Toshiba
     LT089AC19000, Innolux AT043TN24

  i915:
   - Remove Coffeelake from alpha support
   - Cannonlake workarounds
   - Infoframe refactoring for DisplayPort
   - VBT updates
   - DisplayPort vswing/emph/buffer translation refactoring
   - CCS fixes
   - Restore GPU clock boost on missed vblanks
   - Scatter list updates for userptr allocations
   - Gen9+ transition watermarks
   - Display IPC (Isochronous Priority Control)
   - Private PAT management
   - GVT: improved error handling and pci config sanitizing
   - Execlist refactoring
   - Transparent Huge Page support
   - User defined priorities support
   - HuC/GuC firmware refactoring
   - DP MST fixes
   - eDP power sequencing fixes
   - Use RCU instead of stop_machine
   - PSR state tracking support
   - Eviction fixes
   - BDW DP aux channel timeout fixes
   - LSPCON fixes
   - Cannonlake PLL fixes

  amdgpu:
   - Per VM BO support
   - Powerplay cleanups
   - CI powerplay support
   - PASID mgr for kfd
   - SR-IOV fixes
   - initial GPU reset for vega10
   - Prime mmap support
   - TTM updates
   - Clock query interface for Raven
   - Fence to handle ioctl
   - UVD encode ring support on Polaris
   - Transparent huge page DMA support
   - Compute LRU pipe tweaks
   - BO flag to allow buffers to opt out of implicit sync
   - CTX priority setting API
   - VRAM lost infrastructure plumbing

  qxl:
   - fix flicker since atomic rework

  amdkfd:
   - Further improvements from internal AMD tree
   - Usermode events
   - Drop radeon support

  nouveau:
   - Pascal temperature sensor support
   - Improved BAR2 handling
   - MMU rework to support Pascal MMU

  exynos:
   - Improved HDMI/mixer support
   - HDMI audio interface support

  tegra:
   - Prep work for tegra186
   - Cleanup/fixes

  msm:
   - Preemption support for a5xx
   - Display fixes for 8x96 (snapdragon 820)
   - Async cursor plane fixes
   - FW loading rework
   - GPU debugging improvements

  vc4:
   - Prep for DSI panels
   - fix T-format tiling scanout
   - New madvise ioctl

  Rockchip:
   - LVDS support

  omapdrm:
   - omap4 HDMI CEC support

  etnaviv:
   - GPU performance counters groundwork

  sun4i:
   - refactor driver load + TCON backend
   - HDMI improvements
   - A31 support
   - Misc fixes

  udl:
   - Probe/EDID read fixes.

  tilcdc:
   - Misc fixes.

  pl111:
   - Support more variants

  adv7511:
   - Improve EDID handling.
   - HDMI CEC support

  sii8620:
   - Add remote control support"

* tag 'drm-for-v4.15' of git://people.freedesktop.org/~airlied/linux: (1480 commits)
  drm/rockchip: analogix_dp: Use mutex rather than spinlock
  drm/mode_object: fix documentation for object lookups.
  drm/i915: Reorder context-close to avoid calling i915_vma_close() under RCU
  drm/i915: Move init_clock_gating() back to where it was
  drm/i915: Prune the reservation shared fence array
  drm/i915: Idle the GPU before shinking everything
  drm/i915: Lock llist_del_first() vs llist_del_all()
  drm/i915: Calculate ironlake intermediate watermarks correctly, v2.
  drm/i915: Disable lazy PPGTT page table optimization for vGPU
  drm/i915/execlists: Remove the priority "optimisation"
  drm/i915: Filter out spurious execlists context-switch interrupts
  drm/amdgpu: use irq-safe lock for kiq->ring_lock
  drm/amdgpu: bypass lru touch for KIQ ring submission
  drm/amdgpu: Potential uninitialized variable in amdgpu_vm_update_directories()
  drm/amdgpu: potential uninitialized variable in amdgpu_vce_ring_parse_cs()
  drm/amd/powerplay: initialize a variable before using it
  drm/amd/powerplay: suppress KASAN out of bounds warning in vega10_populate_all_memory_levels
  drm/amd/amdgpu: fix evicted VRAM bo adjudgement condition
  drm/vblank: Tune drm_crtc_accurate_vblank_count() WARN down to a debug
  drm/rockchip: add CONFIG_OF dependency for lvds
  ...
2017-11-15 20:42:10 -08:00
Linus Torvalds 7c225c69f8 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc bits

 - ocfs2 updates

 - almost all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (131 commits)
  memory hotplug: fix comments when adding section
  mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP
  mm: simplify nodemask printing
  mm,oom_reaper: remove pointless kthread_run() error check
  mm/page_ext.c: check if page_ext is not prepared
  writeback: remove unused function parameter
  mm: do not rely on preempt_count in print_vma_addr
  mm, sparse: do not swamp log with huge vmemmap allocation failures
  mm/hmm: remove redundant variable align_end
  mm/list_lru.c: mark expected switch fall-through
  mm/shmem.c: mark expected switch fall-through
  mm/page_alloc.c: broken deferred calculation
  mm: don't warn about allocations which stall for too long
  fs: fuse: account fuse_inode slab memory as reclaimable
  mm, page_alloc: fix potential false positive in __zone_watermark_ok
  mm: mlock: remove lru_add_drain_all()
  mm, sysctl: make NUMA stats configurable
  shmem: convert shmem_init_inodecache() to void
  Unify migrate_pages and move_pages access checks
  mm, pagevec: rename pagevec drained field
  ...
2017-11-15 19:42:40 -08:00
Mel Gorman 2d4894b5d2 mm: remove cold parameter from free_hot_cold_page*
Most callers users of free_hot_cold_page claim the pages being released
are cache hot.  The exception is the page reclaim paths where it is
likely that enough pages will be freed in the near future that the
per-cpu lists are going to be recycled and the cache hotness information
is lost.  As no one really cares about the hotness of pages being
released to the allocator, just ditch the parameter.

The APIs are renamed to indicate that it's no longer about hot/cold
pages.  It should also be less confusing as there are subtle differences
between them.  __free_pages drops a reference and frees a page when the
refcount reaches zero.  free_hot_cold_page handled pages whose refcount
was already zero which is non-obvious from the name.  free_unref_page
should be more obvious.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

[mgorman@techsingularity.net: add pages to head, not tail]
  Link: http://lkml.kernel.org/r/20171019154321.qtpzaeftoyyw4iey@techsingularity.net
Link: http://lkml.kernel.org/r/20171018075952.10627-8-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Levin, Alexander (Sasha Levin) 75f296d93b kmemcheck: stop using GFP_NOTRACK and SLAB_NOTRACK
Convert all allocations that used a NOTRACK flag to stop using it.

Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Hansen <devtimhansen@gmail.com>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Kirill A. Shutemov b4e98d9ac7 mm: account pud page tables
On a machine with 5-level paging support a process can allocate
significant amount of memory and stay unnoticed by oom-killer and memory
cgroup.  The trick is to allocate a lot of PUD page tables.  We don't
account PUD page tables, only PMD and PTE.

We already addressed the same issue for PMD page tables, see commit
dc6c9a35b6 ("mm: account pmd page tables to the process").
Introduction of 5-level paging brings the same issue for PUD page
tables.

The patch expands accounting to PUD level.

[kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/]
  Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com
[heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting]
  Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com
Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00