2016-04-29 21:26:05 +08:00
|
|
|
/*
|
|
|
|
* TLB flush routines for radix kernels.
|
|
|
|
*
|
|
|
|
* Copyright 2015-2016, Aneesh Kumar K.V, IBM Corporation.
|
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public License
|
|
|
|
* as published by the Free Software Foundation; either version
|
|
|
|
* 2 of the License, or (at your option) any later version.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/hugetlb.h>
|
|
|
|
#include <linux/memblock.h>
|
|
|
|
|
powerpc/mm/radix: Workaround prefetch issue with KVM
There's a somewhat architectural issue with Radix MMU and KVM.
When coming out of a guest with AIL (Alternate Interrupt Location, ie,
MMU enabled), we start executing hypervisor code with the PID register
still containing whatever the guest has been using.
The problem is that the CPU can (and will) then start prefetching or
speculatively load from whatever host context has that same PID (if
any), thus bringing translations for that context into the TLB, which
Linux doesn't know about.
This can cause stale translations and subsequent crashes.
Fixing this in a way that is neither racy nor a huge performance
impact is difficult. We could just make the host invalidations always
use broadcast forms but that would hurt single threaded programs for
example.
We chose to fix it instead by partitioning the PID space between guest
and host. This is possible because today Linux only use 19 out of the
20 bits of PID space, so existing guests will work if we make the host
use the top half of the 20 bits space.
We additionally add support for a property to indicate to Linux the
size of the PID register which will be useful if we eventually have
processors with a larger PID space available.
There is still an issue with malicious guests purposefully setting the
PID register to a value in the hosts PID range. Hopefully future HW
can prevent that, but in the meantime, we handle it with a pair of
kludges:
- On the way out of a guest, before we clear the current VCPU in the
PACA, we check the PID and if it's outside of the permitted range
we flush the TLB for that PID.
- When context switching, if the mm is "new" on that CPU (the
corresponding bit was set for the first time in the mm cpumask), we
check if any sibling thread is in KVM (has a non-NULL VCPU pointer
in the PACA). If that is the case, we also flush the PID for that
CPU (core).
This second part is needed to handle the case where a process is
migrated (or starts a new pthread) on a sibling thread of the CPU
coming out of KVM, as there's a window where stale translations can
exist before we detect it and flush them out.
A future optimization could be added by keeping track of whether the
PID has ever been used and avoid doing that for completely fresh PIDs.
We could similarily mark PIDs that have been the subject of a global
invalidation as "fresh". But for now this will do.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
unneeded include of kvm_book3s_asm.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-24 12:26:06 +08:00
|
|
|
#include <asm/ppc-opcode.h>
|
2016-04-29 21:26:05 +08:00
|
|
|
#include <asm/tlb.h>
|
|
|
|
#include <asm/tlbflush.h>
|
2017-04-11 13:23:25 +08:00
|
|
|
#include <asm/trace.h>
|
powerpc/mm/radix: Workaround prefetch issue with KVM
There's a somewhat architectural issue with Radix MMU and KVM.
When coming out of a guest with AIL (Alternate Interrupt Location, ie,
MMU enabled), we start executing hypervisor code with the PID register
still containing whatever the guest has been using.
The problem is that the CPU can (and will) then start prefetching or
speculatively load from whatever host context has that same PID (if
any), thus bringing translations for that context into the TLB, which
Linux doesn't know about.
This can cause stale translations and subsequent crashes.
Fixing this in a way that is neither racy nor a huge performance
impact is difficult. We could just make the host invalidations always
use broadcast forms but that would hurt single threaded programs for
example.
We chose to fix it instead by partitioning the PID space between guest
and host. This is possible because today Linux only use 19 out of the
20 bits of PID space, so existing guests will work if we make the host
use the top half of the 20 bits space.
We additionally add support for a property to indicate to Linux the
size of the PID register which will be useful if we eventually have
processors with a larger PID space available.
There is still an issue with malicious guests purposefully setting the
PID register to a value in the hosts PID range. Hopefully future HW
can prevent that, but in the meantime, we handle it with a pair of
kludges:
- On the way out of a guest, before we clear the current VCPU in the
PACA, we check the PID and if it's outside of the permitted range
we flush the TLB for that PID.
- When context switching, if the mm is "new" on that CPU (the
corresponding bit was set for the first time in the mm cpumask), we
check if any sibling thread is in KVM (has a non-NULL VCPU pointer
in the PACA). If that is the case, we also flush the PID for that
CPU (core).
This second part is needed to handle the case where a process is
migrated (or starts a new pthread) on a sibling thread of the CPU
coming out of KVM, as there's a window where stale translations can
exist before we detect it and flush them out.
A future optimization could be added by keeping track of whether the
PID has ever been used and avoid doing that for completely fresh PIDs.
We could similarily mark PIDs that have been the subject of a global
invalidation as "fresh". But for now this will do.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
unneeded include of kvm_book3s_asm.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-24 12:26:06 +08:00
|
|
|
#include <asm/cputhreads.h>
|
2016-04-29 21:26:05 +08:00
|
|
|
|
2016-06-08 22:25:50 +08:00
|
|
|
#define RIC_FLUSH_TLB 0
|
|
|
|
#define RIC_FLUSH_PWC 1
|
|
|
|
#define RIC_FLUSH_ALL 2
|
|
|
|
|
powerpc/64s: Improve local TLB flush for boot and MCE on POWER9
There are several cases outside the normal address space management
where a CPU's entire local TLB is to be flushed:
1. Booting the kernel, in case something has left stale entries in
the TLB (e.g., kexec).
2. Machine check, to clean corrupted TLB entries.
One other place where the TLB is flushed, is waking from deep idle
states. The flush is a side-effect of calling ->cpu_restore with the
intention of re-setting various SPRs. The flush itself is unnecessary
because in the first case, the TLB should not acquire new corrupted
TLB entries as part of sleep/wake (though they may be lost).
This type of TLB flush is coded inflexibly, several times for each CPU
type, and they have a number of problems with ISA v3.0B:
- The current radix mode of the MMU is not taken into account, it is
always done as a hash flushn For IS=2 (LPID-matching flush from host)
and IS=3 with HV=0 (guest kernel flush), tlbie(l) is undefined if
the R field does not match the current radix mode.
- ISA v3.0B hash must flush the partition and process table caches as
well.
- ISA v3.0B radix must flush partition and process scoped translations,
partition and process table caches, and also the page walk cache.
So consolidate the flushing code and implement it in C and inline asm
under the mm/ directory with the rest of the flush code. Add ISA v3.0B
cases for radix and hash, and use the radix flush in radix environment.
Provide a way for IS=2 (LPID flush) to specify the radix mode of the
partition. Have KVM pass in the radix mode of the guest.
Take out the flushes from early cputable/dt_cpu_ftrs detection hooks,
and move it later in the boot process after, the MMU registers are set
up and before relocation is first turned on.
The TLB flush is no longer called when restoring from deep idle states.
This was not be done as a separate step because booting secondaries
uses the same cpu_restore as idle restore, which needs the TLB flush.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-12-23 23:15:50 +08:00
|
|
|
/*
|
|
|
|
* tlbiel instruction for radix, set invalidation
|
|
|
|
* i.e., r=1 and is=01 or is=10 or is=11
|
|
|
|
*/
|
|
|
|
static inline void tlbiel_radix_set_isa300(unsigned int set, unsigned int is,
|
|
|
|
unsigned int pid,
|
|
|
|
unsigned int ric, unsigned int prs)
|
|
|
|
{
|
|
|
|
unsigned long rb;
|
|
|
|
unsigned long rs;
|
|
|
|
unsigned int r = 1; /* radix format */
|
|
|
|
|
|
|
|
rb = (set << PPC_BITLSHIFT(51)) | (is << PPC_BITLSHIFT(53));
|
|
|
|
rs = ((unsigned long)pid << PPC_BITLSHIFT(31));
|
|
|
|
|
|
|
|
asm volatile(PPC_TLBIEL(%0, %1, %2, %3, %4)
|
|
|
|
: : "r"(rb), "r"(rs), "i"(ric), "i"(prs), "r"(r)
|
|
|
|
: "memory");
|
|
|
|
}
|
|
|
|
|
|
|
|
static void tlbiel_all_isa300(unsigned int num_sets, unsigned int is)
|
|
|
|
{
|
|
|
|
unsigned int set;
|
|
|
|
|
|
|
|
asm volatile("ptesync": : :"memory");
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Flush the first set of the TLB, and the entire Page Walk Cache
|
|
|
|
* and partition table entries. Then flush the remaining sets of the
|
|
|
|
* TLB.
|
|
|
|
*/
|
|
|
|
tlbiel_radix_set_isa300(0, is, 0, RIC_FLUSH_ALL, 0);
|
|
|
|
for (set = 1; set < num_sets; set++)
|
|
|
|
tlbiel_radix_set_isa300(set, is, 0, RIC_FLUSH_TLB, 0);
|
|
|
|
|
|
|
|
/* Do the same for process scoped entries. */
|
|
|
|
tlbiel_radix_set_isa300(0, is, 0, RIC_FLUSH_ALL, 1);
|
|
|
|
for (set = 1; set < num_sets; set++)
|
|
|
|
tlbiel_radix_set_isa300(set, is, 0, RIC_FLUSH_TLB, 1);
|
|
|
|
|
|
|
|
asm volatile("ptesync": : :"memory");
|
|
|
|
}
|
|
|
|
|
|
|
|
void radix__tlbiel_all(unsigned int action)
|
|
|
|
{
|
|
|
|
unsigned int is;
|
|
|
|
|
|
|
|
switch (action) {
|
|
|
|
case TLB_INVAL_SCOPE_GLOBAL:
|
|
|
|
is = 3;
|
|
|
|
break;
|
|
|
|
case TLB_INVAL_SCOPE_LPID:
|
|
|
|
is = 2;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (early_cpu_has_feature(CPU_FTR_ARCH_300))
|
|
|
|
tlbiel_all_isa300(POWER9_TLB_SETS_RADIX, is);
|
|
|
|
else
|
|
|
|
WARN(1, "%s called on pre-POWER9 CPU\n", __func__);
|
|
|
|
|
|
|
|
asm volatile(PPC_INVALIDATE_ERAT "; isync" : : :"memory");
|
|
|
|
}
|
|
|
|
|
2016-06-08 22:25:50 +08:00
|
|
|
static inline void __tlbiel_pid(unsigned long pid, int set,
|
|
|
|
unsigned long ric)
|
2016-04-29 21:26:05 +08:00
|
|
|
{
|
2016-06-08 22:25:50 +08:00
|
|
|
unsigned long rb,rs,prs,r;
|
2016-04-29 21:26:05 +08:00
|
|
|
|
|
|
|
rb = PPC_BIT(53); /* IS = 1 */
|
|
|
|
rb |= set << PPC_BITLSHIFT(51);
|
|
|
|
rs = ((unsigned long)pid) << PPC_BITLSHIFT(31);
|
|
|
|
prs = 1; /* process scoped */
|
2018-02-01 13:07:25 +08:00
|
|
|
r = 1; /* radix format */
|
2016-04-29 21:26:05 +08:00
|
|
|
|
2016-07-13 17:35:20 +08:00
|
|
|
asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
|
2016-04-29 21:26:05 +08:00
|
|
|
: : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : "memory");
|
2017-04-11 13:23:25 +08:00
|
|
|
trace_tlbie(0, 1, rb, rs, ric, prs, r);
|
2016-04-29 21:26:05 +08:00
|
|
|
}
|
|
|
|
|
2017-11-07 15:53:09 +08:00
|
|
|
static inline void __tlbie_pid(unsigned long pid, unsigned long ric)
|
|
|
|
{
|
|
|
|
unsigned long rb,rs,prs,r;
|
|
|
|
|
|
|
|
rb = PPC_BIT(53); /* IS = 1 */
|
|
|
|
rs = pid << PPC_BITLSHIFT(31);
|
|
|
|
prs = 1; /* process scoped */
|
2018-02-01 13:07:25 +08:00
|
|
|
r = 1; /* radix format */
|
2017-11-07 15:53:09 +08:00
|
|
|
|
|
|
|
asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
|
|
|
|
: : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : "memory");
|
|
|
|
trace_tlbie(0, 0, rb, rs, ric, prs, r);
|
|
|
|
}
|
|
|
|
|
2018-03-23 12:56:26 +08:00
|
|
|
static inline void __tlbiel_va(unsigned long va, unsigned long pid,
|
|
|
|
unsigned long ap, unsigned long ric)
|
|
|
|
{
|
|
|
|
unsigned long rb,rs,prs,r;
|
|
|
|
|
|
|
|
rb = va & ~(PPC_BITMASK(52, 63));
|
|
|
|
rb |= ap << PPC_BITLSHIFT(58);
|
|
|
|
rs = pid << PPC_BITLSHIFT(31);
|
|
|
|
prs = 1; /* process scoped */
|
2018-03-28 19:59:50 +08:00
|
|
|
r = 1; /* radix format */
|
2018-03-23 12:56:26 +08:00
|
|
|
|
|
|
|
asm volatile(PPC_TLBIEL(%0, %4, %3, %2, %1)
|
|
|
|
: : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : "memory");
|
|
|
|
trace_tlbie(0, 1, rb, rs, ric, prs, r);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void __tlbie_va(unsigned long va, unsigned long pid,
|
|
|
|
unsigned long ap, unsigned long ric)
|
|
|
|
{
|
|
|
|
unsigned long rb,rs,prs,r;
|
|
|
|
|
|
|
|
rb = va & ~(PPC_BITMASK(52, 63));
|
|
|
|
rb |= ap << PPC_BITLSHIFT(58);
|
|
|
|
rs = pid << PPC_BITLSHIFT(31);
|
|
|
|
prs = 1; /* process scoped */
|
2018-03-28 19:59:50 +08:00
|
|
|
r = 1; /* radix format */
|
2018-03-23 12:56:26 +08:00
|
|
|
|
|
|
|
asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
|
|
|
|
: : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(rs) : "memory");
|
|
|
|
trace_tlbie(0, 0, rb, rs, ric, prs, r);
|
|
|
|
}
|
|
|
|
|
2018-03-23 12:56:27 +08:00
|
|
|
static inline void fixup_tlbie(void)
|
|
|
|
{
|
|
|
|
unsigned long pid = 0;
|
|
|
|
unsigned long va = ((1UL << 52) - 1);
|
|
|
|
|
|
|
|
if (cpu_has_feature(CPU_FTR_P9_TLBIE_BUG)) {
|
|
|
|
asm volatile("ptesync": : :"memory");
|
|
|
|
__tlbie_va(va, pid, mmu_get_ap(MMU_PAGE_64K), RIC_FLUSH_TLB);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-04-29 21:26:05 +08:00
|
|
|
/*
|
|
|
|
* We use 128 set in radix mode and 256 set in hpt mode.
|
|
|
|
*/
|
2016-06-08 22:25:50 +08:00
|
|
|
static inline void _tlbiel_pid(unsigned long pid, unsigned long ric)
|
2016-04-29 21:26:05 +08:00
|
|
|
{
|
|
|
|
int set;
|
|
|
|
|
2017-04-01 22:41:48 +08:00
|
|
|
asm volatile("ptesync": : :"memory");
|
2017-04-26 19:38:17 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Flush the first set of the TLB, and if we're doing a RIC_FLUSH_ALL,
|
|
|
|
* also flush the entire Page Walk Cache.
|
|
|
|
*/
|
|
|
|
__tlbiel_pid(pid, 0, ric);
|
|
|
|
|
2017-07-19 12:49:04 +08:00
|
|
|
/* For PWC, only one flush is needed */
|
|
|
|
if (ric == RIC_FLUSH_PWC) {
|
|
|
|
asm volatile("ptesync": : :"memory");
|
|
|
|
return;
|
|
|
|
}
|
2017-04-26 19:38:17 +08:00
|
|
|
|
2017-07-19 12:49:04 +08:00
|
|
|
/* For the remaining sets, just flush the TLB */
|
2017-04-26 19:38:17 +08:00
|
|
|
for (set = 1; set < POWER9_TLB_SETS_RADIX ; set++)
|
2017-07-19 12:49:04 +08:00
|
|
|
__tlbiel_pid(pid, set, RIC_FLUSH_TLB);
|
2017-04-26 19:38:17 +08:00
|
|
|
|
2017-04-01 22:41:48 +08:00
|
|
|
asm volatile("ptesync": : :"memory");
|
2017-02-06 10:05:16 +08:00
|
|
|
asm volatile(PPC_INVALIDATE_ERAT "; isync" : : :"memory");
|
2016-04-29 21:26:05 +08:00
|
|
|
}
|
|
|
|
|
2016-06-08 22:25:50 +08:00
|
|
|
static inline void _tlbie_pid(unsigned long pid, unsigned long ric)
|
2016-04-29 21:26:05 +08:00
|
|
|
{
|
|
|
|
asm volatile("ptesync": : :"memory");
|
|
|
|
|
2018-03-23 06:29:06 +08:00
|
|
|
/*
|
|
|
|
* Workaround the fact that the "ric" argument to __tlbie_pid
|
|
|
|
* must be a compile-time contraint to match the "i" constraint
|
|
|
|
* in the asm statement.
|
|
|
|
*/
|
|
|
|
switch (ric) {
|
|
|
|
case RIC_FLUSH_TLB:
|
|
|
|
__tlbie_pid(pid, RIC_FLUSH_TLB);
|
|
|
|
break;
|
|
|
|
case RIC_FLUSH_PWC:
|
|
|
|
__tlbie_pid(pid, RIC_FLUSH_PWC);
|
|
|
|
break;
|
|
|
|
case RIC_FLUSH_ALL:
|
|
|
|
default:
|
|
|
|
__tlbie_pid(pid, RIC_FLUSH_ALL);
|
|
|
|
}
|
2018-03-23 12:56:27 +08:00
|
|
|
fixup_tlbie();
|
2016-04-29 21:26:05 +08:00
|
|
|
asm volatile("eieio; tlbsync; ptesync": : :"memory");
|
|
|
|
}
|
|
|
|
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
static inline void __tlbiel_va_range(unsigned long start, unsigned long end,
|
|
|
|
unsigned long pid, unsigned long page_size,
|
|
|
|
unsigned long psize)
|
|
|
|
{
|
|
|
|
unsigned long addr;
|
|
|
|
unsigned long ap = mmu_get_ap(psize);
|
|
|
|
|
|
|
|
for (addr = start; addr < end; addr += page_size)
|
|
|
|
__tlbiel_va(addr, pid, ap, RIC_FLUSH_TLB);
|
|
|
|
}
|
|
|
|
|
2017-11-07 15:53:05 +08:00
|
|
|
static inline void _tlbiel_va(unsigned long va, unsigned long pid,
|
2017-11-07 15:53:06 +08:00
|
|
|
unsigned long psize, unsigned long ric)
|
2017-11-07 15:53:05 +08:00
|
|
|
{
|
2017-11-07 15:53:06 +08:00
|
|
|
unsigned long ap = mmu_get_ap(psize);
|
|
|
|
|
2017-11-07 15:53:05 +08:00
|
|
|
asm volatile("ptesync": : :"memory");
|
|
|
|
__tlbiel_va(va, pid, ap, ric);
|
|
|
|
asm volatile("ptesync": : :"memory");
|
|
|
|
}
|
|
|
|
|
2017-11-07 15:53:06 +08:00
|
|
|
static inline void _tlbiel_va_range(unsigned long start, unsigned long end,
|
|
|
|
unsigned long pid, unsigned long page_size,
|
2017-11-07 15:53:09 +08:00
|
|
|
unsigned long psize, bool also_pwc)
|
2017-11-07 15:53:06 +08:00
|
|
|
{
|
|
|
|
asm volatile("ptesync": : :"memory");
|
2017-11-07 15:53:09 +08:00
|
|
|
if (also_pwc)
|
|
|
|
__tlbiel_pid(pid, 0, RIC_FLUSH_PWC);
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
__tlbiel_va_range(start, end, pid, page_size, psize);
|
2017-11-07 15:53:06 +08:00
|
|
|
asm volatile("ptesync": : :"memory");
|
|
|
|
}
|
|
|
|
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
static inline void __tlbie_va_range(unsigned long start, unsigned long end,
|
|
|
|
unsigned long pid, unsigned long page_size,
|
|
|
|
unsigned long psize)
|
|
|
|
{
|
|
|
|
unsigned long addr;
|
|
|
|
unsigned long ap = mmu_get_ap(psize);
|
|
|
|
|
|
|
|
for (addr = start; addr < end; addr += page_size)
|
|
|
|
__tlbie_va(addr, pid, ap, RIC_FLUSH_TLB);
|
|
|
|
}
|
|
|
|
|
2017-11-07 15:53:05 +08:00
|
|
|
static inline void _tlbie_va(unsigned long va, unsigned long pid,
|
2017-11-07 15:53:06 +08:00
|
|
|
unsigned long psize, unsigned long ric)
|
2017-11-07 15:53:05 +08:00
|
|
|
{
|
2017-11-07 15:53:06 +08:00
|
|
|
unsigned long ap = mmu_get_ap(psize);
|
|
|
|
|
2017-11-07 15:53:05 +08:00
|
|
|
asm volatile("ptesync": : :"memory");
|
|
|
|
__tlbie_va(va, pid, ap, ric);
|
2018-03-23 12:56:27 +08:00
|
|
|
fixup_tlbie();
|
2017-11-07 15:53:05 +08:00
|
|
|
asm volatile("eieio; tlbsync; ptesync": : :"memory");
|
|
|
|
}
|
|
|
|
|
2017-11-07 15:53:06 +08:00
|
|
|
static inline void _tlbie_va_range(unsigned long start, unsigned long end,
|
|
|
|
unsigned long pid, unsigned long page_size,
|
2017-11-07 15:53:09 +08:00
|
|
|
unsigned long psize, bool also_pwc)
|
2017-11-07 15:53:06 +08:00
|
|
|
{
|
|
|
|
asm volatile("ptesync": : :"memory");
|
2017-11-07 15:53:09 +08:00
|
|
|
if (also_pwc)
|
|
|
|
__tlbie_pid(pid, RIC_FLUSH_PWC);
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
__tlbie_va_range(start, end, pid, page_size, psize);
|
2018-03-23 12:56:27 +08:00
|
|
|
fixup_tlbie();
|
2017-11-07 15:53:06 +08:00
|
|
|
asm volatile("eieio; tlbsync; ptesync": : :"memory");
|
|
|
|
}
|
2017-11-07 15:53:05 +08:00
|
|
|
|
2016-04-29 21:26:05 +08:00
|
|
|
/*
|
|
|
|
* Base TLB flushing operations:
|
|
|
|
*
|
|
|
|
* - flush_tlb_mm(mm) flushes the specified mm context TLB's
|
|
|
|
* - flush_tlb_page(vma, vmaddr) flushes one page
|
|
|
|
* - flush_tlb_range(vma, start, end) flushes a range of pages
|
|
|
|
* - flush_tlb_kernel_range(start, end) flushes kernel pages
|
|
|
|
*
|
|
|
|
* - local_* variants of page and mm only apply to the current
|
|
|
|
* processor
|
|
|
|
*/
|
|
|
|
void radix__local_flush_tlb_mm(struct mm_struct *mm)
|
|
|
|
{
|
2016-06-02 17:44:48 +08:00
|
|
|
unsigned long pid;
|
2016-04-29 21:26:05 +08:00
|
|
|
|
|
|
|
preempt_disable();
|
|
|
|
pid = mm->context.id;
|
|
|
|
if (pid != MMU_NO_CONTEXT)
|
2017-07-19 12:49:05 +08:00
|
|
|
_tlbiel_pid(pid, RIC_FLUSH_TLB);
|
2016-04-29 21:26:05 +08:00
|
|
|
preempt_enable();
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(radix__local_flush_tlb_mm);
|
|
|
|
|
2017-07-19 12:49:05 +08:00
|
|
|
#ifndef CONFIG_SMP
|
2017-09-04 02:15:12 +08:00
|
|
|
void radix__local_flush_all_mm(struct mm_struct *mm)
|
2016-06-08 22:25:51 +08:00
|
|
|
{
|
|
|
|
unsigned long pid;
|
|
|
|
|
|
|
|
preempt_disable();
|
|
|
|
pid = mm->context.id;
|
|
|
|
if (pid != MMU_NO_CONTEXT)
|
2017-07-19 12:49:05 +08:00
|
|
|
_tlbiel_pid(pid, RIC_FLUSH_ALL);
|
2016-06-08 22:25:51 +08:00
|
|
|
preempt_enable();
|
|
|
|
}
|
2017-09-04 02:15:12 +08:00
|
|
|
EXPORT_SYMBOL(radix__local_flush_all_mm);
|
2017-07-19 12:49:05 +08:00
|
|
|
#endif /* CONFIG_SMP */
|
2016-06-08 22:25:51 +08:00
|
|
|
|
2016-07-13 17:36:41 +08:00
|
|
|
void radix__local_flush_tlb_page_psize(struct mm_struct *mm, unsigned long vmaddr,
|
2016-07-13 17:36:42 +08:00
|
|
|
int psize)
|
2016-04-29 21:26:05 +08:00
|
|
|
{
|
2016-06-02 17:44:48 +08:00
|
|
|
unsigned long pid;
|
2016-04-29 21:26:05 +08:00
|
|
|
|
|
|
|
preempt_disable();
|
2017-10-16 15:11:00 +08:00
|
|
|
pid = mm->context.id;
|
2016-04-29 21:26:05 +08:00
|
|
|
if (pid != MMU_NO_CONTEXT)
|
2017-11-07 15:53:06 +08:00
|
|
|
_tlbiel_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
|
2016-04-29 21:26:05 +08:00
|
|
|
preempt_enable();
|
|
|
|
}
|
|
|
|
|
|
|
|
void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned long vmaddr)
|
|
|
|
{
|
2016-04-29 21:26:25 +08:00
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
|
|
|
/* need the return fix for nohash.c */
|
2017-10-16 15:11:00 +08:00
|
|
|
if (is_vm_hugetlb_page(vma))
|
|
|
|
return radix__local_flush_hugetlb_page(vma, vmaddr);
|
2016-04-29 21:26:25 +08:00
|
|
|
#endif
|
2017-10-16 15:11:00 +08:00
|
|
|
radix__local_flush_tlb_page_psize(vma->vm_mm, vmaddr, mmu_virtual_psize);
|
2016-04-29 21:26:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(radix__local_flush_tlb_page);
|
|
|
|
|
2018-03-23 06:29:06 +08:00
|
|
|
static bool mm_needs_flush_escalation(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* P9 nest MMU has issues with the page walk cache
|
|
|
|
* caching PTEs and not flushing them properly when
|
|
|
|
* RIC = 0 for a PID/LPID invalidate
|
|
|
|
*/
|
|
|
|
return atomic_read(&mm->context.copros) != 0;
|
|
|
|
}
|
|
|
|
|
2016-04-29 21:26:05 +08:00
|
|
|
#ifdef CONFIG_SMP
|
|
|
|
void radix__flush_tlb_mm(struct mm_struct *mm)
|
|
|
|
{
|
2016-06-02 17:44:48 +08:00
|
|
|
unsigned long pid;
|
2016-04-29 21:26:05 +08:00
|
|
|
|
|
|
|
pid = mm->context.id;
|
|
|
|
if (unlikely(pid == MMU_NO_CONTEXT))
|
2017-10-24 21:06:53 +08:00
|
|
|
return;
|
2016-04-29 21:26:05 +08:00
|
|
|
|
2017-10-24 21:06:53 +08:00
|
|
|
preempt_disable();
|
2018-03-23 06:29:06 +08:00
|
|
|
if (!mm_is_thread_local(mm)) {
|
|
|
|
if (mm_needs_flush_escalation(mm))
|
|
|
|
_tlbie_pid(pid, RIC_FLUSH_ALL);
|
|
|
|
else
|
|
|
|
_tlbie_pid(pid, RIC_FLUSH_TLB);
|
|
|
|
} else
|
2017-07-19 12:49:05 +08:00
|
|
|
_tlbiel_pid(pid, RIC_FLUSH_TLB);
|
2016-04-29 21:26:05 +08:00
|
|
|
preempt_enable();
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(radix__flush_tlb_mm);
|
|
|
|
|
2017-09-04 02:15:12 +08:00
|
|
|
void radix__flush_all_mm(struct mm_struct *mm)
|
2016-06-08 22:25:51 +08:00
|
|
|
{
|
|
|
|
unsigned long pid;
|
|
|
|
|
|
|
|
pid = mm->context.id;
|
|
|
|
if (unlikely(pid == MMU_NO_CONTEXT))
|
2017-10-24 21:06:53 +08:00
|
|
|
return;
|
2016-06-08 22:25:51 +08:00
|
|
|
|
2017-10-24 21:06:53 +08:00
|
|
|
preempt_disable();
|
2017-05-02 19:00:14 +08:00
|
|
|
if (!mm_is_thread_local(mm))
|
2017-07-19 12:49:05 +08:00
|
|
|
_tlbie_pid(pid, RIC_FLUSH_ALL);
|
2017-05-02 19:00:14 +08:00
|
|
|
else
|
2017-07-19 12:49:05 +08:00
|
|
|
_tlbiel_pid(pid, RIC_FLUSH_ALL);
|
2016-06-08 22:25:51 +08:00
|
|
|
preempt_enable();
|
|
|
|
}
|
2017-09-04 02:15:12 +08:00
|
|
|
EXPORT_SYMBOL(radix__flush_all_mm);
|
2017-07-19 12:49:05 +08:00
|
|
|
|
|
|
|
void radix__flush_tlb_pwc(struct mmu_gather *tlb, unsigned long addr)
|
|
|
|
{
|
|
|
|
tlb->need_flush_all = 1;
|
|
|
|
}
|
2016-06-08 22:25:51 +08:00
|
|
|
EXPORT_SYMBOL(radix__flush_tlb_pwc);
|
|
|
|
|
2016-07-13 17:36:41 +08:00
|
|
|
void radix__flush_tlb_page_psize(struct mm_struct *mm, unsigned long vmaddr,
|
2016-07-13 17:36:42 +08:00
|
|
|
int psize)
|
2016-04-29 21:26:05 +08:00
|
|
|
{
|
2016-06-02 17:44:48 +08:00
|
|
|
unsigned long pid;
|
2016-04-29 21:26:05 +08:00
|
|
|
|
2017-10-16 15:11:00 +08:00
|
|
|
pid = mm->context.id;
|
2016-04-29 21:26:05 +08:00
|
|
|
if (unlikely(pid == MMU_NO_CONTEXT))
|
2017-10-24 21:06:53 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
preempt_disable();
|
2017-05-02 19:00:14 +08:00
|
|
|
if (!mm_is_thread_local(mm))
|
2017-11-07 15:53:06 +08:00
|
|
|
_tlbie_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
|
2017-05-02 19:00:14 +08:00
|
|
|
else
|
2017-11-07 15:53:06 +08:00
|
|
|
_tlbiel_va(vmaddr, pid, psize, RIC_FLUSH_TLB);
|
2016-04-29 21:26:05 +08:00
|
|
|
preempt_enable();
|
|
|
|
}
|
|
|
|
|
|
|
|
void radix__flush_tlb_page(struct vm_area_struct *vma, unsigned long vmaddr)
|
|
|
|
{
|
2016-04-29 21:26:25 +08:00
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
2017-10-16 15:11:00 +08:00
|
|
|
if (is_vm_hugetlb_page(vma))
|
|
|
|
return radix__flush_hugetlb_page(vma, vmaddr);
|
2016-04-29 21:26:25 +08:00
|
|
|
#endif
|
2017-10-16 15:11:00 +08:00
|
|
|
radix__flush_tlb_page_psize(vma->vm_mm, vmaddr, mmu_virtual_psize);
|
2016-04-29 21:26:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(radix__flush_tlb_page);
|
|
|
|
|
2017-07-19 12:49:05 +08:00
|
|
|
#else /* CONFIG_SMP */
|
|
|
|
#define radix__flush_all_mm radix__local_flush_all_mm
|
2016-04-29 21:26:05 +08:00
|
|
|
#endif /* CONFIG_SMP */
|
|
|
|
|
|
|
|
void radix__flush_tlb_kernel_range(unsigned long start, unsigned long end)
|
|
|
|
{
|
2016-06-08 22:25:50 +08:00
|
|
|
_tlbie_pid(0, RIC_FLUSH_ALL);
|
2016-04-29 21:26:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(radix__flush_tlb_kernel_range);
|
|
|
|
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
#define TLB_FLUSH_ALL -1UL
|
|
|
|
|
2016-04-29 21:26:05 +08:00
|
|
|
/*
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
* Number of pages above which we invalidate the entire PID rather than
|
|
|
|
* flush individual pages, for local and global flushes respectively.
|
|
|
|
*
|
|
|
|
* tlbie goes out to the interconnect and individual ops are more costly.
|
|
|
|
* It also does not iterate over sets like the local tlbiel variant when
|
|
|
|
* invalidating a full PID, so it has a far lower threshold to change from
|
|
|
|
* individual page flushes to full-pid flushes.
|
2016-04-29 21:26:05 +08:00
|
|
|
*/
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
static unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
|
powerpc/64s/radix: Introduce local single page ceiling for TLB range flush
The single page flush ceiling is the cut-off point at which we switch
from invalidating individual pages, to invalidating the entire process
address space in response to a range flush.
Introduce a local variant of this heuristic because local and global
tlbie have significantly different properties:
- Local tlbiel requires 128 instructions to invalidate a PID, global
tlbie only 1 instruction.
- Global tlbie instructions are expensive broadcast operations.
The local ceiling has been made much higher, 2x the number of
instructions required to invalidate the entire PID (i.e., 256 pages).
Time to mprotect N pages of memory (after mmap, touch), local invalidate:
N 32 34 64 128 256 512
vanilla 7.4us 9.0us 14.6us 26.4us 50.2us 98.3us
patched 7.4us 7.8us 13.8us 26.4us 51.9us 98.3us
The behaviour of both is identical at N=32 and N=512. Between there,
the vanilla kernel does a PID invalidate and the patched kernel does
a va range invalidate.
At N=128, these require the same number of tlbiel instructions, so
the patched version can be sen to be cheaper when < 128, and more
expensive when > 128. However this does not well capture the cost
of invalidated TLB.
The additional cost at 256 pages does not seem prohibitive. It may
be the case that increasing the limit further would continue to be
beneficial to avoid invalidating all of the process's TLB entries.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:08 +08:00
|
|
|
static unsigned long tlb_local_single_page_flush_ceiling __read_mostly = POWER9_TLB_SETS_RADIX * 2;
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
|
2016-04-29 21:26:05 +08:00
|
|
|
void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
|
|
|
|
unsigned long end)
|
|
|
|
|
|
|
|
{
|
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
unsigned long pid;
|
|
|
|
unsigned int page_shift = mmu_psize_defs[mmu_virtual_psize].shift;
|
|
|
|
unsigned long page_size = 1UL << page_shift;
|
|
|
|
unsigned long nr_pages = (end - start) >> page_shift;
|
|
|
|
bool local, full;
|
2017-07-19 12:49:05 +08:00
|
|
|
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
|
|
|
if (is_vm_hugetlb_page(vma))
|
|
|
|
return radix__flush_hugetlb_tlb_range(vma, start, end);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
pid = mm->context.id;
|
|
|
|
if (unlikely(pid == MMU_NO_CONTEXT))
|
|
|
|
return;
|
|
|
|
|
|
|
|
preempt_disable();
|
powerpc/64s/radix: Introduce local single page ceiling for TLB range flush
The single page flush ceiling is the cut-off point at which we switch
from invalidating individual pages, to invalidating the entire process
address space in response to a range flush.
Introduce a local variant of this heuristic because local and global
tlbie have significantly different properties:
- Local tlbiel requires 128 instructions to invalidate a PID, global
tlbie only 1 instruction.
- Global tlbie instructions are expensive broadcast operations.
The local ceiling has been made much higher, 2x the number of
instructions required to invalidate the entire PID (i.e., 256 pages).
Time to mprotect N pages of memory (after mmap, touch), local invalidate:
N 32 34 64 128 256 512
vanilla 7.4us 9.0us 14.6us 26.4us 50.2us 98.3us
patched 7.4us 7.8us 13.8us 26.4us 51.9us 98.3us
The behaviour of both is identical at N=32 and N=512. Between there,
the vanilla kernel does a PID invalidate and the patched kernel does
a va range invalidate.
At N=128, these require the same number of tlbiel instructions, so
the patched version can be sen to be cheaper when < 128, and more
expensive when > 128. However this does not well capture the cost
of invalidated TLB.
The additional cost at 256 pages does not seem prohibitive. It may
be the case that increasing the limit further would continue to be
beneficial to avoid invalidating all of the process's TLB entries.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:08 +08:00
|
|
|
if (mm_is_thread_local(mm)) {
|
|
|
|
local = true;
|
|
|
|
full = (end == TLB_FLUSH_ALL ||
|
|
|
|
nr_pages > tlb_local_single_page_flush_ceiling);
|
|
|
|
} else {
|
|
|
|
local = false;
|
|
|
|
full = (end == TLB_FLUSH_ALL ||
|
|
|
|
nr_pages > tlb_single_page_flush_ceiling);
|
|
|
|
}
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
|
|
|
|
if (full) {
|
2018-03-23 06:29:06 +08:00
|
|
|
if (local) {
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
_tlbiel_pid(pid, RIC_FLUSH_TLB);
|
2018-03-23 06:29:06 +08:00
|
|
|
} else {
|
|
|
|
if (mm_needs_flush_escalation(mm))
|
|
|
|
_tlbie_pid(pid, RIC_FLUSH_ALL);
|
|
|
|
else
|
|
|
|
_tlbie_pid(pid, RIC_FLUSH_TLB);
|
|
|
|
}
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
} else {
|
|
|
|
bool hflush = false;
|
|
|
|
unsigned long hstart, hend;
|
|
|
|
|
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
|
|
hstart = (start + HPAGE_PMD_SIZE - 1) >> HPAGE_PMD_SHIFT;
|
|
|
|
hend = end >> HPAGE_PMD_SHIFT;
|
|
|
|
if (hstart < hend) {
|
|
|
|
hstart <<= HPAGE_PMD_SHIFT;
|
|
|
|
hend <<= HPAGE_PMD_SHIFT;
|
|
|
|
hflush = true;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
asm volatile("ptesync": : :"memory");
|
|
|
|
if (local) {
|
|
|
|
__tlbiel_va_range(start, end, pid, page_size, mmu_virtual_psize);
|
|
|
|
if (hflush)
|
|
|
|
__tlbiel_va_range(hstart, hend, pid,
|
|
|
|
HPAGE_PMD_SIZE, MMU_PAGE_2M);
|
|
|
|
asm volatile("ptesync": : :"memory");
|
|
|
|
} else {
|
|
|
|
__tlbie_va_range(start, end, pid, page_size, mmu_virtual_psize);
|
|
|
|
if (hflush)
|
|
|
|
__tlbie_va_range(hstart, hend, pid,
|
|
|
|
HPAGE_PMD_SIZE, MMU_PAGE_2M);
|
2018-03-23 12:56:27 +08:00
|
|
|
fixup_tlbie();
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
asm volatile("eieio; tlbsync; ptesync": : :"memory");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
preempt_enable();
|
2016-04-29 21:26:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(radix__flush_tlb_range);
|
|
|
|
|
2016-07-13 17:35:29 +08:00
|
|
|
static int radix_get_mmu_psize(int page_size)
|
|
|
|
{
|
|
|
|
int psize;
|
|
|
|
|
|
|
|
if (page_size == (1UL << mmu_psize_defs[mmu_virtual_psize].shift))
|
|
|
|
psize = mmu_virtual_psize;
|
|
|
|
else if (page_size == (1UL << mmu_psize_defs[MMU_PAGE_2M].shift))
|
|
|
|
psize = MMU_PAGE_2M;
|
|
|
|
else if (page_size == (1UL << mmu_psize_defs[MMU_PAGE_1G].shift))
|
|
|
|
psize = MMU_PAGE_1G;
|
|
|
|
else
|
|
|
|
return -1;
|
|
|
|
return psize;
|
|
|
|
}
|
2016-04-29 21:26:05 +08:00
|
|
|
|
2017-11-07 15:53:09 +08:00
|
|
|
static void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long start,
|
|
|
|
unsigned long end, int psize);
|
|
|
|
|
2016-04-29 21:26:05 +08:00
|
|
|
void radix__tlb_flush(struct mmu_gather *tlb)
|
|
|
|
{
|
2016-07-13 17:36:35 +08:00
|
|
|
int psize = 0;
|
2016-04-29 21:26:05 +08:00
|
|
|
struct mm_struct *mm = tlb->mm;
|
2016-07-13 17:36:35 +08:00
|
|
|
int page_size = tlb->page_size;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* if page size is not something we understand, do a full mm flush
|
2017-10-24 21:06:54 +08:00
|
|
|
*
|
|
|
|
* A "fullmm" flush must always do a flush_all_mm (RIC=2) flush
|
|
|
|
* that flushes the process table entry cache upon process teardown.
|
|
|
|
* See the comment for radix in arch_exit_mmap().
|
2016-07-13 17:36:35 +08:00
|
|
|
*/
|
2017-11-07 15:53:09 +08:00
|
|
|
if (tlb->fullmm) {
|
2017-07-19 12:49:05 +08:00
|
|
|
radix__flush_all_mm(mm);
|
2017-11-07 15:53:09 +08:00
|
|
|
} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
|
|
|
|
if (!tlb->need_flush_all)
|
|
|
|
radix__flush_tlb_mm(mm);
|
|
|
|
else
|
|
|
|
radix__flush_all_mm(mm);
|
|
|
|
} else {
|
|
|
|
unsigned long start = tlb->start;
|
|
|
|
unsigned long end = tlb->end;
|
|
|
|
|
|
|
|
if (!tlb->need_flush_all)
|
|
|
|
radix__flush_tlb_range_psize(mm, start, end, psize);
|
|
|
|
else
|
|
|
|
radix__flush_tlb_pwc_range_psize(mm, start, end, psize);
|
|
|
|
}
|
|
|
|
tlb->need_flush_all = 0;
|
2016-07-13 17:36:35 +08:00
|
|
|
}
|
|
|
|
|
2017-11-07 15:53:09 +08:00
|
|
|
static inline void __radix__flush_tlb_range_psize(struct mm_struct *mm,
|
|
|
|
unsigned long start, unsigned long end,
|
|
|
|
int psize, bool also_pwc)
|
2016-07-13 17:36:35 +08:00
|
|
|
{
|
|
|
|
unsigned long pid;
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
unsigned int page_shift = mmu_psize_defs[psize].shift;
|
|
|
|
unsigned long page_size = 1UL << page_shift;
|
|
|
|
unsigned long nr_pages = (end - start) >> page_shift;
|
|
|
|
bool local, full;
|
2016-07-13 17:36:35 +08:00
|
|
|
|
2017-10-16 15:11:00 +08:00
|
|
|
pid = mm->context.id;
|
2016-07-13 17:36:35 +08:00
|
|
|
if (unlikely(pid == MMU_NO_CONTEXT))
|
2017-10-24 21:06:53 +08:00
|
|
|
return;
|
2016-07-13 17:36:35 +08:00
|
|
|
|
2017-10-24 21:06:53 +08:00
|
|
|
preempt_disable();
|
powerpc/64s/radix: Introduce local single page ceiling for TLB range flush
The single page flush ceiling is the cut-off point at which we switch
from invalidating individual pages, to invalidating the entire process
address space in response to a range flush.
Introduce a local variant of this heuristic because local and global
tlbie have significantly different properties:
- Local tlbiel requires 128 instructions to invalidate a PID, global
tlbie only 1 instruction.
- Global tlbie instructions are expensive broadcast operations.
The local ceiling has been made much higher, 2x the number of
instructions required to invalidate the entire PID (i.e., 256 pages).
Time to mprotect N pages of memory (after mmap, touch), local invalidate:
N 32 34 64 128 256 512
vanilla 7.4us 9.0us 14.6us 26.4us 50.2us 98.3us
patched 7.4us 7.8us 13.8us 26.4us 51.9us 98.3us
The behaviour of both is identical at N=32 and N=512. Between there,
the vanilla kernel does a PID invalidate and the patched kernel does
a va range invalidate.
At N=128, these require the same number of tlbiel instructions, so
the patched version can be sen to be cheaper when < 128, and more
expensive when > 128. However this does not well capture the cost
of invalidated TLB.
The additional cost at 256 pages does not seem prohibitive. It may
be the case that increasing the limit further would continue to be
beneficial to avoid invalidating all of the process's TLB entries.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:08 +08:00
|
|
|
if (mm_is_thread_local(mm)) {
|
|
|
|
local = true;
|
|
|
|
full = (end == TLB_FLUSH_ALL ||
|
|
|
|
nr_pages > tlb_local_single_page_flush_ceiling);
|
|
|
|
} else {
|
|
|
|
local = false;
|
|
|
|
full = (end == TLB_FLUSH_ALL ||
|
|
|
|
nr_pages > tlb_single_page_flush_ceiling);
|
|
|
|
}
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
|
|
|
|
if (full) {
|
2018-03-23 06:29:06 +08:00
|
|
|
if (!local && mm_needs_flush_escalation(mm))
|
|
|
|
also_pwc = true;
|
|
|
|
|
2016-07-13 17:36:35 +08:00
|
|
|
if (local)
|
2017-11-07 15:53:09 +08:00
|
|
|
_tlbiel_pid(pid, also_pwc ? RIC_FLUSH_ALL : RIC_FLUSH_TLB);
|
2016-07-13 17:36:35 +08:00
|
|
|
else
|
2017-11-07 15:53:09 +08:00
|
|
|
_tlbie_pid(pid, also_pwc ? RIC_FLUSH_ALL: RIC_FLUSH_TLB);
|
2017-10-24 21:06:53 +08:00
|
|
|
} else {
|
2017-11-07 15:53:05 +08:00
|
|
|
if (local)
|
2017-11-07 15:53:09 +08:00
|
|
|
_tlbiel_va_range(start, end, pid, page_size, psize, also_pwc);
|
2017-11-07 15:53:05 +08:00
|
|
|
else
|
2017-11-07 15:53:09 +08:00
|
|
|
_tlbie_va_range(start, end, pid, page_size, psize, also_pwc);
|
2016-07-13 17:36:35 +08:00
|
|
|
}
|
|
|
|
preempt_enable();
|
2016-04-29 21:26:05 +08:00
|
|
|
}
|
2016-07-13 17:35:29 +08:00
|
|
|
|
2017-11-07 15:53:09 +08:00
|
|
|
void radix__flush_tlb_range_psize(struct mm_struct *mm, unsigned long start,
|
|
|
|
unsigned long end, int psize)
|
|
|
|
{
|
|
|
|
return __radix__flush_tlb_range_psize(mm, start, end, psize, false);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void radix__flush_tlb_pwc_range_psize(struct mm_struct *mm, unsigned long start,
|
|
|
|
unsigned long end, int psize)
|
|
|
|
{
|
|
|
|
__radix__flush_tlb_range_psize(mm, start, end, psize, true);
|
|
|
|
}
|
|
|
|
|
2017-07-19 12:49:06 +08:00
|
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
|
|
void radix__flush_tlb_collapsed_pmd(struct mm_struct *mm, unsigned long addr)
|
|
|
|
{
|
|
|
|
unsigned long pid, end;
|
|
|
|
|
2017-10-16 15:11:00 +08:00
|
|
|
pid = mm->context.id;
|
2017-07-19 12:49:06 +08:00
|
|
|
if (unlikely(pid == MMU_NO_CONTEXT))
|
2017-10-24 21:06:53 +08:00
|
|
|
return;
|
2017-07-19 12:49:06 +08:00
|
|
|
|
|
|
|
/* 4k page size, just blow the world */
|
|
|
|
if (PAGE_SIZE == 0x1000) {
|
|
|
|
radix__flush_all_mm(mm);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
end = addr + HPAGE_PMD_SIZE;
|
|
|
|
|
|
|
|
/* Otherwise first do the PWC, then iterate the pages. */
|
2017-10-24 21:06:53 +08:00
|
|
|
preempt_disable();
|
2017-07-19 12:49:06 +08:00
|
|
|
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
if (mm_is_thread_local(mm)) {
|
2017-11-07 15:53:09 +08:00
|
|
|
_tlbiel_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
} else {
|
2017-11-07 15:53:09 +08:00
|
|
|
_tlbie_va_range(addr, end, pid, PAGE_SIZE, mmu_virtual_psize, true);
|
powerpc/64s/radix: Optimize flush_tlb_range
Currently for radix, flush_tlb_range flushes the entire PID, because
the Linux mm code does not tell us about page size here for THP vs
regular pages. This is quite sub-optimal for small mremap / mprotect
/ change_protection.
So implement va range flushes with two flush passes, one for each
page size (regular and THP). The second flush has an order of matnitude
fewer tlbie instructions than the first, so it is a relatively small
additional cost.
There is still room for improvement here with some changes to generic
APIs, particularly if there are mostly THP pages to be invalidated,
the small page flushes could be reduced.
Time to mprotect 1 page of memory (after mmap, touch):
vanilla 2.9us 1.8us
patched 1.2us 1.6us
Time to mprotect 30 pages of memory (after mmap, touch):
vanilla 8.2us 7.2us
patched 6.9us 17.9us
Time to mprotect 34 pages of memory (after mmap, touch):
vanilla 9.1us 8.0us
patched 9.0us 8.0us
34 pages is the point at which the invalidation switches from va
to entire PID, which tlbie can do in a single instruction. This is
why in the case of 30 pages, the new code runs slower for this test.
This is a deliberate tradeoff already present in the unmap and THP
promotion code, the idea is that the benefit from avoiding flushing
entire TLB for this PID on all threads in the system.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-11-07 15:53:07 +08:00
|
|
|
}
|
2017-11-07 15:53:05 +08:00
|
|
|
|
2017-07-19 12:49:06 +08:00
|
|
|
preempt_enable();
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
|
|
|
|
2016-07-13 17:36:40 +08:00
|
|
|
void radix__flush_pmd_tlb_range(struct vm_area_struct *vma,
|
|
|
|
unsigned long start, unsigned long end)
|
|
|
|
{
|
|
|
|
radix__flush_tlb_range_psize(vma->vm_mm, start, end, MMU_PAGE_2M);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(radix__flush_pmd_tlb_range);
|
2016-08-23 18:57:48 +08:00
|
|
|
|
|
|
|
void radix__flush_tlb_all(void)
|
|
|
|
{
|
|
|
|
unsigned long rb,prs,r,rs;
|
|
|
|
unsigned long ric = RIC_FLUSH_ALL;
|
|
|
|
|
|
|
|
rb = 0x3 << PPC_BITLSHIFT(53); /* IS = 3 */
|
|
|
|
prs = 0; /* partition scoped */
|
2018-02-01 13:07:25 +08:00
|
|
|
r = 1; /* radix format */
|
2016-08-23 18:57:48 +08:00
|
|
|
rs = 1 & ((1UL << 32) - 1); /* any LPID value to flush guest mappings */
|
|
|
|
|
|
|
|
asm volatile("ptesync": : :"memory");
|
|
|
|
/*
|
|
|
|
* now flush guest entries by passing PRS = 1 and LPID != 0
|
|
|
|
*/
|
|
|
|
asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
|
|
|
|
: : "r"(rb), "i"(r), "i"(1), "i"(ric), "r"(rs) : "memory");
|
|
|
|
/*
|
|
|
|
* now flush host entires by passing PRS = 0 and LPID == 0
|
|
|
|
*/
|
|
|
|
asm volatile(PPC_TLBIE_5(%0, %4, %3, %2, %1)
|
|
|
|
: : "r"(rb), "i"(r), "i"(prs), "i"(ric), "r"(0) : "memory");
|
|
|
|
asm volatile("eieio; tlbsync; ptesync": : :"memory");
|
|
|
|
}
|
2016-11-28 14:17:01 +08:00
|
|
|
|
|
|
|
void radix__flush_tlb_pte_p9_dd1(unsigned long old_pte, struct mm_struct *mm,
|
|
|
|
unsigned long address)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We track page size in pte only for DD1, So we can
|
|
|
|
* call this only on DD1.
|
|
|
|
*/
|
|
|
|
if (!cpu_has_feature(CPU_FTR_POWER9_DD1)) {
|
|
|
|
VM_WARN_ON(1);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2017-03-22 01:29:54 +08:00
|
|
|
if (old_pte & R_PAGE_LARGE)
|
2016-11-28 14:17:01 +08:00
|
|
|
radix__flush_tlb_page_psize(mm, address, MMU_PAGE_2M);
|
|
|
|
else
|
|
|
|
radix__flush_tlb_page_psize(mm, address, mmu_virtual_psize);
|
|
|
|
}
|
powerpc/mm/radix: Workaround prefetch issue with KVM
There's a somewhat architectural issue with Radix MMU and KVM.
When coming out of a guest with AIL (Alternate Interrupt Location, ie,
MMU enabled), we start executing hypervisor code with the PID register
still containing whatever the guest has been using.
The problem is that the CPU can (and will) then start prefetching or
speculatively load from whatever host context has that same PID (if
any), thus bringing translations for that context into the TLB, which
Linux doesn't know about.
This can cause stale translations and subsequent crashes.
Fixing this in a way that is neither racy nor a huge performance
impact is difficult. We could just make the host invalidations always
use broadcast forms but that would hurt single threaded programs for
example.
We chose to fix it instead by partitioning the PID space between guest
and host. This is possible because today Linux only use 19 out of the
20 bits of PID space, so existing guests will work if we make the host
use the top half of the 20 bits space.
We additionally add support for a property to indicate to Linux the
size of the PID register which will be useful if we eventually have
processors with a larger PID space available.
There is still an issue with malicious guests purposefully setting the
PID register to a value in the hosts PID range. Hopefully future HW
can prevent that, but in the meantime, we handle it with a pair of
kludges:
- On the way out of a guest, before we clear the current VCPU in the
PACA, we check the PID and if it's outside of the permitted range
we flush the TLB for that PID.
- When context switching, if the mm is "new" on that CPU (the
corresponding bit was set for the first time in the mm cpumask), we
check if any sibling thread is in KVM (has a non-NULL VCPU pointer
in the PACA). If that is the case, we also flush the PID for that
CPU (core).
This second part is needed to handle the case where a process is
migrated (or starts a new pthread) on a sibling thread of the CPU
coming out of KVM, as there's a window where stale translations can
exist before we detect it and flush them out.
A future optimization could be added by keeping track of whether the
PID has ever been used and avoid doing that for completely fresh PIDs.
We could similarily mark PIDs that have been the subject of a global
invalidation as "fresh". But for now this will do.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
unneeded include of kvm_book3s_asm.h]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-07-24 12:26:06 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
|
|
|
|
extern void radix_kvm_prefetch_workaround(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
unsigned int pid = mm->context.id;
|
|
|
|
|
|
|
|
if (unlikely(pid == MMU_NO_CONTEXT))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If this context hasn't run on that CPU before and KVM is
|
|
|
|
* around, there's a slim chance that the guest on another
|
|
|
|
* CPU just brought in obsolete translation into the TLB of
|
|
|
|
* this CPU due to a bad prefetch using the guest PID on
|
|
|
|
* the way into the hypervisor.
|
|
|
|
*
|
|
|
|
* We work around this here. If KVM is possible, we check if
|
|
|
|
* any sibling thread is in KVM. If it is, the window may exist
|
|
|
|
* and thus we flush that PID from the core.
|
|
|
|
*
|
|
|
|
* A potential future improvement would be to mark which PIDs
|
|
|
|
* have never been used on the system and avoid it if the PID
|
|
|
|
* is new and the process has no other cpumask bit set.
|
|
|
|
*/
|
|
|
|
if (cpu_has_feature(CPU_FTR_HVMODE) && radix_enabled()) {
|
|
|
|
int cpu = smp_processor_id();
|
|
|
|
int sib = cpu_first_thread_sibling(cpu);
|
|
|
|
bool flush = false;
|
|
|
|
|
|
|
|
for (; sib <= cpu_last_thread_sibling(cpu) && !flush; sib++) {
|
|
|
|
if (sib == cpu)
|
|
|
|
continue;
|
|
|
|
if (paca[sib].kvm_hstate.kvm_vcpu)
|
|
|
|
flush = true;
|
|
|
|
}
|
|
|
|
if (flush)
|
|
|
|
_tlbiel_pid(pid, RIC_FLUSH_ALL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(radix_kvm_prefetch_workaround);
|
|
|
|
#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
|