UAMOR values are not application-specific. The kernel initializes
its value based on different reserved keys. Remove the thread-specific
UAMOR value and don't switch the UAMOR on context switch.
Move UAMOR initialization to key initialization code and remove
thread_struct.uamor because it is not used anymore.
Before commit: 4a4a5e5d2a ("powerpc/pkeys: key allocation/deallocation must not change pkey registers")
we used to update uamor based on key allocation and free.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709032946.881753-20-aneesh.kumar@linux.ibm.com
As we kexec across kernels that use AMR/IAMR for different purposes
we need to ensure that new kernels get kexec'd with a reset value
of AMR/IAMR. For ex: the new kernel can use key 0 for kernel mapping and the old
AMR value prevents access to key 0.
This patch also removes reset if IAMR and AMOR in kexec_sequence. Reset of AMOR
is not needed and the IAMR reset is partial (it doesn't do the reset
on secondary cpus) and is redundant with this patch.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709032946.881753-19-aneesh.kumar@linux.ibm.com
Parse storage keys related device tree entry in early_init_devtree
and enable MMU feature MMU_FTR_PKEY if pkeys are supported.
MMU feature is used instead of CPU feature because this enables us
to group MMU_FTR_KUAP and MMU_FTR_PKEY in asm feature fixup code.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709032946.881753-14-aneesh.kumar@linux.ibm.com
Introduce notification chain which lets us know about uncorrected memory
errors(UE). This would help prospective users in pmem or nvdimm subsystem
to track bad blocks for better handling of persistent memory allocations.
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200709135142.721504-1-santosh@fossix.org
When '029ab30b4c0a ("powerpc/mm: Enable radix GTSE only if supported.")'
made GTSE an MMU feature, it was enabled by default in
powerpc-cpu-features but was missed in pa-features. This causes random
memory corruption during boot of PowerNV kernels where
CONFIG_PPC_DT_CPU_FTRS isn't enabled.
Fixes: 029ab30b4c ("powerpc/mm: Enable radix GTSE only if supported.")
Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
[mpe: Unwrap long line]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200720044258.863574-1-bharata@linux.ibm.com
The device-tree properties to check secure and trusted boot state are
different for guests (pseries) compared to baremetal (powernv).
This patch updates the existing is_ppc_secureboot_enabled() and
is_ppc_trustedboot_enabled() functions to add support for pseries.
For pseries the secureboot and trustedboot state are exposed via
device-tree properties /ibm,secure-boot and /ibm,trusted-boot.
The values of ibm,secure-boot under pseries are interpreted as:
0 - Disabled
1 - Enabled in Log-only mode. This patch interprets this value as
disabled, since audit mode is currently not supported for
Linux.
2 - Enabled and enforced.
3-9 - Enabled and enforcing; requirements are at the discretion of
the operating system.
The values of ibm,trusted-boot under pseries are interpreted as:
0 - Disabled
1 - Enabled
Signed-off-by: Nayna Jain <nayna@linux.ibm.com>
Reviewed-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>
[mpe: Drop machdep.h inclusion, tweak change log slightly]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1594813921-12425-1-git-send-email-nayna@linux.ibm.com
The code in vdso_cpu_init that exposes the cpu and numa node to
userspace via SPRG_VDSO incorrctly masks the cpu to 12 bits. This means
that any kernel running on a box with more than 4096 threads (NR_CPUS
advertises a limit of of 8192 cpus) would expose userspace to two cpu
contexts running at the same time with the same cpu number.
Note: I'm not aware of any distro shipping a kernel with support for more
than 4096 threads today, nor of any system image that currently exceeds
4096 threads. Found via code browsing.
Fixes: 18ad51dd34 ("powerpc: Add VDSO version of getcpu")
Signed-off-by: Milton Miller <miltonm@us.ibm.com>
Signed-off-by: Anton Blanchard <anton@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200715233704.1352257-1-anton@ozlabs.org
When we enter into fadump crash path via system reset we fail to update
the pstore.
On the system reset path we first update the pstore then we go for fadump
crash. But the problem here is when all the CPUs try to get the pstore
lock to initiate the pstore write, only one CPUs will acquire the lock
and proceed with the pstore write. Since it in NMI context CPUs that fail
to get lock do not wait for their turn to write to the pstore and simply
proceed with the next operation which is fadump crash. One of the CPU who
proceeded with fadump crash path triggers the crash and does not wait for
the CPU who gets the pstore lock to complete the pstore update.
Timeline diagram to depicts the sequence of events that leads to an
unsuccessful pstore update when we hit fadump crash path via system reset.
1 2 3 ... n CPU Threads
| | | |
| | | |
Reached to -->|--->|---->| ----------->|
system reset | | | |
path | | | |
| | | |
Try to -->|--->|---->|------------>|
acquire the | | | |
pstore lock | | | |
| | | |
| | | |
Got the -->| +->| | |<-+
pstore lock | | | | | |--> Didn't get the
| --------------------------+ lock and moving
| | | | ahead on fadump
| | | | crash path
| | | |
Begins the -->| | | |
process to | | | |<-- Got the chance to
update the | | | | trigger the crash
pstore | -> | | ... <- |
| | | | | |
| | | | | |<-- Triggers the
| | | | | | crash
| | | | | | ^
| | | | | | |
Writing to -->| | | | | | |
pstore | | | | | | |
| | |
^ |__________________| |
| CPU Relax |
| |
+-----------------------------------------+
|
v
Race: crash triggered before pstore
update completes
To avoid this race condition a barrier is added on crash_fadump path, it
prevents the CPU to trigger the crash until all the online CPUs completes
their task.
A barrier is added to make sure all the secondary CPUs hit the
crash_fadump function before we initiates the crash. A timeout is kept to
ensure the primary CPU (one who initiates the crash) do not wait for
secondary CPUs indefinitely.
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200713052435.183750-1-sourabhjain@linux.ibm.com
prrn_is_enabled() always returns false/0, so handle_rtas_event() can
be simplified and some dead code can be removed. Use machine_is()
instead of #ifdef to run this code only on pseries, and add an
informational ratelimited message that we are ignoring the
events. PRRN events are relatively rare in normal operation and
usually arise from operator-initiated actions such as a DPO (Dynamic
Platform Optimizer) run.
Eventually we do want to consume these events and update the device
tree, but that needs more care to be safe vs LPM and DLPAR.
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200612051238.1007764-13-nathanl@linux.ibm.com
Partition suspension, used for hibernation and migration, requires
that the OS place all but one of the LPAR's processor threads into one
of two states prior to calling the ibm,suspend-me RTAS function:
* the architected offline state (via RTAS stop-self); or
* the H_JOIN hcall, which does not return until the partition
resumes execution
Using H_CEDE as the offline mode, introduced by
commit 3aa565f53c ("powerpc/pseries: Add hooks to put the CPU into
an appropriate offline state"), means that any threads which are
offline from Linux's point of view must be moved to one of those two
states before a partition suspension can proceed.
This was eventually addressed in commit 120496ac2d ("powerpc: Bring
all threads online prior to migration/hibernation"), which added code
to temporarily bring up any offline processor threads so they can call
H_JOIN. Conceptually this is fine, but the implementation has had
multiple races with cpu hotplug operations initiated from user
space[1][2][3], the error handling is fragile, and it generates
user-visible cpu hotplug events which is a lot of noise for a platform
feature that's supposed to minimize disruption to workloads.
With commit 3aa565f53c ("powerpc/pseries: Add hooks to put the CPU
into an appropriate offline state") reverted, this code becomes
unnecessary, so remove it. Since any offline CPUs now are truly
offline from the platform's point of view, it is no longer necessary
to bring up CPUs only to have them call H_JOIN and then go offline
again upon resuming. Only active threads are required to call H_JOIN;
stopped threads can be left alone.
[1] commit a6717c01dd ("powerpc/rtas: use device model APIs and
serialization during LPM")
[2] commit 9fb603050f ("powerpc/rtas: retry when cpu offline races
with suspend/migration")
[3] commit dfd718a2ed ("powerpc/rtas: Fix a potential race between
CPU-Offline & Migration")
Fixes: 120496ac2d ("powerpc: Bring all threads online prior to migration/hibernation")
Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200612051238.1007764-3-nathanl@linux.ibm.com
If both count cache and link stack are to be flushed, and can be flushed
with the special bcctr, patch that in directly to the flush/branch nop
site.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200609070610.846703-7-npiggin@gmail.com
Branch cache flushing code patching has inter-dependencies on both the
link stack and the count cache flushing state.
To make the code clearer and to separate the link stack and count
cache handling, split the "toggle" (setting up variables and printing
enable/disable) from the code patching.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Always print something, even if the flush is disabled]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200609070610.846703-5-npiggin@gmail.com
Prepare to allow for hardware link stack flushing by using the
none/sw/hw type, same as the count cache state.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200609070610.846703-3-npiggin@gmail.com
The count cache flush mostly refers to both count cache and link stack
flushing. As a first step to untangling these a bit, re-name the bits
that apply to both.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200609070610.846703-2-npiggin@gmail.com
When a FP/VEC/VSX unavailable fault loads registers and enables the
facility in the MSR, re-set the lazy restore counters to 1 rather
than incrementing them so every fault gets the same number of
restores before the next fault.
This probably shouldn't be a practical change because if a lazy counter
was non-zero then it should have been restored and would not cause a
fault when userspace tries to access it. However the code and comment
implies otherwise so that's misleading and unnecessary.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200623234139.2262227-3-npiggin@gmail.com
Before returning to user, if there are missing FP/VEC/VSX bits from the
user MSR then those registers had been saved and must be restored again
before use. restore_math will decide whether to restore immediately, or
skip the restore and let fp/vec/vsx unavailable faults demand load the
registers.
Each time restore_math restores one of the FP/VSX or VEC register sets
is loaded, an 8-bit counter is incremented (load_fp and load_vec). When
these wrap to zero, restore_math no longer restores that register set
until after they are next demand faulted.
It's quite usual for those counters to have different values, so if one
wraps to zero and restore_math no longer restores its registers or user
MSR bit but the other is not zero yet does not need to be restored
(because the kernel is not frequently using the FPU), then restore_math
will be called and it will also not return in the early exit check.
This causes msr_check_and_set to test and set the MSR at every kernel
exit despite having no work to do.
This can cause workloads (e.g., a NULL syscall microbenchmark) to run
fast for a time while both counters are non-zero, then slow down when
one of the counters reaches zero, then speed up again after the second
counter reaches zero. The cost is significant, about 10% slowdown on a
NULL syscall benchmark, and the jittery behaviour is very undesirable.
Fix this by having restore_math test all conditions first, and only
update MSR if we will be loading registers.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200623234139.2262227-2-npiggin@gmail.com
The TM test in restore_math added by commit dc16b553c9 ("powerpc:
Always restore FPU/VEC/VSX if hardware transactional memory in use") is
no longer necessary after commit a8318c13e7 ("powerpc/tm: Fix
restoring FP/VMX facility incorrectly on interrupts"), which removed
the cases where restore_math has to restore if TM is active.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200623234139.2262227-1-npiggin@gmail.com
Make GTSE an MMU feature and enable it by default for radix.
However for guest, conditionally enable it if hypervisor supports
it via OV5 vector. Let prom_init ask for radix GTSE only if the
support exists.
Having GTSE as an MMU feature will make it easy to enable radix
without GTSE. Currently radix assumes GTSE is enabled by default.
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200703053608.12884-2-bharata@linux.ibm.com
save_general_regs() which does special handling when i == PT_SOFTE.
Rewrite it to minimise the specific part, especially the __put_user()
and associated error handling is the same so make it common.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
[mpe: Use a regular if rather than ternary operator]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/47a38df46cae5a5a88a558a64d71f75e9c4d9950.1594125164.git.christophe.leroy@csgroup.eu
Since commit ("1bd79336a426 powerpc: Fix various
syscall/signal/swapcontext bugs"), getting save_general_regs() called
without FULL_REGS() is very unlikely and generates a warning.
The 32-bit version of save_general_regs() doesn't take care of it
at all and copies all registers anyway since that commit.
Moreover, commit 965dd3ad30 ("powerpc/64/syscall: Remove
non-volatile GPR save optimisation") is another reason why it would
never happen.
So the same with 64-bit, don't worry about FULL_REGS() and copy
all registers all the time.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/173de3b659fa3a5f126a0eb170522cccd909950f.1594125164.git.christophe.leroy@csgroup.eu
Returning from an interrupt or syscall to a signal handler currently
begins execution directly at the handler's entry point, with LR set to
the address of the sigreturn trampoline. When the signal handler
function returns, it runs the trampoline. It looks like this:
# interrupt at user address xyz
# kernel stuff... signal is raised
rfid
# void handler(int sig)
addis 2,12,.TOC.-.LCF0@ha
addi 2,2,.TOC.-.LCF0@l
mflr 0
std 0,16(1)
stdu 1,-96(1)
# handler stuff
ld 0,16(1)
mtlr 0
blr
# __kernel_sigtramp_rt64
addi r1,r1,__SIGNAL_FRAMESIZE
li r0,__NR_rt_sigreturn
sc
# kernel executes rt_sigreturn
rfid
# back to user address xyz
Note the blr with no matching bl. This can corrupt the return
predictor.
Solve this by instead resuming execution at the signal trampoline
which then calls the signal handler. qtrace-tools link_stack checker
confirms the entire user/kernel/vdso cycle is balanced after this
patch, whereas it's not upstream.
Alan confirms the dwarf unwind info still looks good. gdb still
recognises the signal frame and can step into parent frames if it
break inside a signal handler.
Performance is pretty noisy, not a very significant change on a POWER9
here, but branch misses are consistently a lot lower on a
microbenchmark:
Performance counter stats for './signal':
13,085.72 msec task-clock # 1.000 CPUs utilized
45,024,760,101 cycles # 3.441 GHz
65,102,895,542 instructions # 1.45 insn per cycle
11,271,673,787 branches # 861.372 M/sec
59,468,979 branch-misses # 0.53% of all branches
12,989.09 msec task-clock # 1.000 CPUs utilized
44,692,719,559 cycles # 3.441 GHz
65,109,984,964 instructions # 1.46 insn per cycle
11,282,136,057 branches # 868.585 M/sec
39,786,942 branch-misses # 0.35% of all branches
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200511101952.1463138-1-npiggin@gmail.com
Unlike drivers/base/cacheinfo, powerpc cacheinfo code is not exposing
shared_cpu_list under /sys/devices/system/cpu/cpu<n>/cache/index<m>
Add shared_cpu_list to per cpu per index directory to maintain parity
with x86. Some scripts (example: mmtests
https://github.com/gormanm/mmtests) seem to be looking for
shared_cpu_list instead of shared_cpu_map.
Before this patch:
# ls /sys/devices/system/cpu0/cache/index1
coherency_line_size number_of_sets size ways_of_associativity
level shared_cpu_map type
# cat /sys/devices/system/cpu0/cache/index1/shared_cpu_map
00ff
#
After this patch:
# ls /sys/devices/system/cpu0/cache/index1
coherency_line_size number_of_sets shared_cpu_map type
level shared_cpu_list size ways_of_associativity
# cat /sys/devices/system/cpu0/cache/index1/shared_cpu_map
00ff
# cat /sys/devices/system/cpu0/cache/index1/shared_cpu_list
0-7
#
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200629103703.4538-4-srikar@linux.vnet.ibm.com
In anticipation of implementing shared_cpu_list, move code under
shared_cpu_map_show() to a common function.
No functional changes.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200629103703.4538-3-srikar@linux.vnet.ibm.com
Tejun Heo had modified shared_cpu_map_show() to use scnprintf instead
of cpumap_print during support for *pb[l] format. Refer commit
0c118b7bd0 ("powerpc: use %*pb[l] to print bitmaps including
cpumasks and nodemasks").
cpumap_print_to_pagebuf() is a standard function to print cpumap. With
commit 9cf79d115f ("bitmap: remove explicit newline handling using
scnprintf format string"), there is no need to print explicit newline
and trailing null character. cpumap_print_to_pagebuf() internally uses
scnprintf(). Hence replace scnprintf() with cpumap_print_to_pagebuf().
Note: shared_cpu_map_show() in drivers/base/cacheinfo.c already uses
cpumap_print_to_pagebuf().
Before this patch:
# cat /sys/devices/system/cpu0/cache/index1/shared_cpu_map
00ff
#
(Notice the extra blank line).
After this patch:
# cat /sys/devices/system/cpu0/cache/index1/shared_cpu_map
00ff
#
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200629103703.4538-2-srikar@linux.vnet.ibm.com
Early secure guest boot hits the below crash while booting with
vcpus numbers aligned with page boundary for PAGE size of 64k
and LPPACA size of 1k i.e 64, 128 etc.
Partition configured for 64 cpus.
CPU maps initialized for 1 thread per core
------------[ cut here ]------------
kernel BUG at arch/powerpc/kernel/paca.c:89!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
This is due to the BUG_ON() for shared_lppaca_total_size equal to
shared_lppaca_size. Instead the code should only BUG_ON() if we have
exceeded the total_size, which indicates we've overflowed the array.
Fixes: bd104e6db6 ("powerpc/pseries/svm: Use shared memory for LPPACA structures")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
[mpe: Reword change log to clarify we're fixing not removing the check]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200619070113.16696-1-sathnaga@linux.vnet.ibm.com
A typo caused the interrupt handler to branch immediately to the
common "unknown interrupt" handler and skip the special case test for
denormal cause.
This does not affect KVM softpatch handling (e.g., for POWER9 TM
assist) because the KVM test was moved to common code by commit
9600f261ac ("powerpc/64s/exception: Move KVM test to common code")
just before this bug was introduced.
Fixes: 3f7fbd97d0 ("powerpc/64s/exception: Clean up SRR specifiers")
Reported-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Paul Menzel <pmenzel@molgen.mpg.de>
[mpe: Split selftest into a separate patch]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200708074942.1713396-1-npiggin@gmail.com
blrl is not recommended to use as an indirect function call, as it may
corrupt the link stack predictor.
This is not a performance critical path but this should be fixed for
consistency.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200611121119.1015740-1-npiggin@gmail.com
One fix for the interrupt rework we did last release which broke KVM-PR.
Three commits fixing some fallout from the READ_ONCE() changes interacting badly
with our 8xx 16K pages support, which uses a pte_t that is a structure of 4
actual PTEs.
A cleanup of the 8xx pte_update() to use the newly added pmd_off().
A fix for a crash when handling an oops if CONFIG_DEBUG_VIRTUAL is enabled.
A minor fix for the SPU syscall generation.
Thanks to:
Aneesh Kumar K.V, Christian Zigotzky, Christophe Leroy, Mike Rapoport,
Nicholas Piggin.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl7vNVsTHG1wZUBlbGxl
cm1hbi5pZC5hdQAKCRBR6+o8yOGlgO3rD/46cXJQ9AMQqtZh3+sgWu95Zd6JOviL
vfhWeH/kbt/p6OGPoLXoYChoFD44Mf7BmTEDflslYICrxvhu9zI2lYN+948zfrEY
lIjjP+Dd6fr1D2o3+hnOOX/LHAVyyZJTsZp5i6ehTOXeUw8KOCF1ulVB3o5GgQK0
I/0oewL/SXNFnZS5qLgF2/OFS/BH3OnDG6mpICxCetZC9mNbHrTzos403ijyrvcX
AsE4JSzI2UM9kT0pWXLa9QR3RgfBZ4wtMrnKAwdGI/E+YqAa7TuHZatPDAqoCJYY
aePEZdweaeLWHQaQYSqlNP7YLAHuSdvZ2SvU65c2EKaaXug9sZJImyboJl/fo0Xo
EtZiVbfaTfqsyi7EVQnsLMFYmtquacXoUH//nIoTro4pRkeMsM94BiK2HISa+8Bs
KGQxBsnK2UaTgWERZHiK2VaKY/Tl1vGs09u7R21GiE2aD25ly+/q1Uo+WUr6iRKh
1v42AsH1VCeEZKAog43gBGOr7bCez8/90GNtTJnKTKndSRSybCH68ME/zBKdNACn
A7M9E0CNNjTOQNJyQ2UhyiBJzUK6kT/5g+C4mEH5WG4FkO6YHT1JyEusvsfj6Oe9
RwDr98iNuM8AhaT30XmUXithDAl6JA5+3S0OcC2bL2xQ0O/VBPGZhIzgSFU8T7BY
qpDj8l/8zk64Fw==
=qws5
-----END PGP SIGNATURE-----
Merge tag 'powerpc-5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- One fix for the interrupt rework we did last release which broke
KVM-PR
- Three commits fixing some fallout from the READ_ONCE() changes
interacting badly with our 8xx 16K pages support, which uses a pte_t
that is a structure of 4 actual PTEs
- A cleanup of the 8xx pte_update() to use the newly added pmd_off()
- A fix for a crash when handling an oops if CONFIG_DEBUG_VIRTUAL is
enabled
- A minor fix for the SPU syscall generation
Thanks to Aneesh Kumar K.V, Christian Zigotzky, Christophe Leroy, Mike
Rapoport, Nicholas Piggin.
* tag 'powerpc-5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/8xx: Provide ptep_get() with 16k pages
mm: Allow arches to provide ptep_get()
mm/gup: Use huge_ptep_get() in gup_hugepte()
powerpc/syscalls: Use the number when building SPU syscall table
powerpc/8xx: use pmd_off() to access a PMD entry in pte_update()
powerpc/64s: Fix KVM interrupt using wrong save area
powerpc: Fix kernel crash in show_instructions() w/DEBUG_VIRTUAL
Now that we've renamed probe_kernel_address() to get_kernel_nofault()
and made it look and behave more in line with get_user(), some of the
subtle type behavior differences end up being more obvious and possibly
dangerous.
When you do
get_user(val, user_ptr);
the type of the access comes from the "user_ptr" part, and the above
basically acts as
val = *user_ptr;
by design (except, of course, for the fact that the actual dereference
is done with a user access).
Note how in the above case, the type of the end result comes from the
pointer argument, and then the value is cast to the type of 'val' as
part of the assignment.
So the type of the pointer is ultimately the more important type both
for the access itself.
But 'get_kernel_nofault()' may now _look_ similar, but it behaves very
differently. When you do
get_kernel_nofault(val, kernel_ptr);
it behaves like
val = *(typeof(val) *)kernel_ptr;
except, of course, for the fact that the actual dereference is done with
exception handling so that a faulting access is suppressed and returned
as the error code.
But note how different the casting behavior of the two superficially
similar accesses are: one does the actual access in the size of the type
the pointer points to, while the other does the access in the size of
the target, and ignores the pointer type entirely.
Actually changing get_kernel_nofault() to act like get_user() is almost
certainly the right thing to do eventually, but in the meantime this
patch adds logit to at least verify that the pointer type is compatible
with the type of the result.
In many cases, this involves just casting the pointer to 'void *' to
make it obvious that the type of the pointer is not the important part.
It's not how 'get_user()' acts, but at least the behavioral difference
is now obvious and explicit.
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Better describe what this helper does, and match the naming of
copy_from_kernel_nofault.
Also switch the argument order around, so that it acts and looks
like get_user().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The CTR register reload in the KVM interrupt path used the wrong save
area for SLB (and NMI) interrupts.
Fixes: 9600f261ac ("powerpc/64s/exception: Move KVM test to common code")
Cc: stable@vger.kernel.org # v5.7+
Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200615061247.1310763-1-npiggin@gmail.com
With CONFIG_DEBUG_VIRTUAL=y, we can hit a BUG() if we take a hard
lockup watchdog interrupt when in OPAL mode.
This happens in show_instructions() if the kernel takes the watchdog
NMI IPI, or any other interrupt, with MSR_IR == 0. show_instructions()
updates the variable pc in the loop and the second iteration will
result in BUG().
We hit the BUG_ON due the below check in __va()
#define __va(x)
({
VIRTUAL_BUG_ON((unsigned long)(x) >= PAGE_OFFSET);
(void *)(unsigned long)((phys_addr_t)(x) | PAGE_OFFSET);
})
Fix it by moving the check out of the loop. Also update nip so that
the nip == pc check still matches.
Fixes: 4dd7554a64 ("powerpc/64: Add VIRTUAL_BUG_ON checks for __va and __pa addresses")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Use IS_ENABLED(), massage change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200524093822.423487-1-aneesh.kumar@linux.ibm.com
The replacement of <asm/pgrable.h> with <linux/pgtable.h> made the include
of the latter in the middle of asm includes. Fix this up with the aid of
the below script and manual adjustments here and there.
import sys
import re
if len(sys.argv) is not 3:
print "USAGE: %s <file> <header>" % (sys.argv[0])
sys.exit(1)
hdr_to_move="#include <linux/%s>" % sys.argv[2]
moved = False
in_hdrs = False
with open(sys.argv[1], "r") as f:
lines = f.readlines()
for _line in lines:
line = _line.rstrip('
')
if line == hdr_to_move:
continue
if line.startswith("#include <linux/"):
in_hdrs = True
elif not moved and in_hdrs:
moved = True
print hdr_to_move
print line
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-4-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The include/linux/pgtable.h is going to be the home of generic page table
manipulation functions.
Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
make the latter include asm/pgtable.h.
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: consolidate definitions of page table accessors", v2.
The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.
Most of these definitions are actually identical and typically it boils
down to, e.g.
static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}
static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}
These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.
These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.
This patch (of 12):
The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include <asm/pgtable.h>
in the files that include <linux/mm.h>.
The include statements in such cases are remove with a simple loop:
for f in $(git grep -l "include <linux/mm.h>") ; do
sed -i -e '/include <asm\/pgtable.h>/ d' $f
done
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: Guan Xuetao <gxt@pku.edu.cn>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now the last users of show_stack() got converted to use an explicit log
level, show_stack_loglvl() can drop it's redundant suffix and become once
again well known show_stack().
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200418201944.482088-51-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the log-level of show_stack() depends on a platform
realization. It creates situations where the headers are printed with
lower log level or higher than the stacktrace (depending on a platform or
user).
Furthermore, it forces the logic decision from user to an architecture
side. In result, some users as sysrq/kdb/etc are doing tricks with
temporary rising console_loglevel while printing their messages. And in
result it not only may print unwanted messages from other CPUs, but also
omit printing at all in the unlucky case where the printk() was deferred.
Introducing log-level parameter and KERN_UNSUPPRESSED [1] seems an easier
approach than introducing more printk buffers. Also, it will consolidate
printings with headers.
Introduce show_stack_loglvl(), that eventually will substitute
show_stack().
[1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/T/#u
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Link: http://lkml.kernel.org/r/20200418201944.482088-27-dima@arista.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>