This frees a PTE bit when using 64K pages on ppc64. This is done
by getting rid of the separate _PAGE_HASHPTE bit. Instead, we just test
if any of the 16 sub-page bits is set. For non-combo pages (ie. real
64K pages), we set SUB0 and the location encoding in that field.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
When we demote a slice from 64k to 4k, and we are about to insert an
HPTE for a 4k subpage and we notice that there is an existing 64k
HPTE, we first invalidate that HPTE before inserting the new 4k
subpage HPTE. Since the bits that encode which hash bucket the old
HPTE was in overlap with the bits that encode which of the 16 subpages
have HPTEs, we need to clear out the subpage HPTE-present bits before
starting to insert HPTEs for the 4k subpages. If we don't do that, we
can erroneously think that a subpage already has an HPTE when it
doesn't.
That in itself wouldn't be such a problem except that when we go to
update the HPTE that we think is present on machines with a
hypervisor, the hypervisor can tell us that the HPTE we think is there
is actually there even though it isn't, which can lead to a process
getting stuck in a loop, continually faulting. The reason for the
confusion is that the AVPN (abbreviated virtual page number) we are
looking for in the HPTE for a 4k subpage can actually match the AVPN
in a stale HPTE for another 64k page. For example, the HPTE for
the 4k subpage at 0x84000f000 will be in the same hash bucket and have
the same AVPN as the HPTE for the 64k page at 0x8400f0000.
This fixes the code to clear out the subpage HPTE-present bits.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Using 64k pages on 64-bit PowerPC systems makes life difficult for
emulators that are trying to emulate an ISA, such as x86, which use a
smaller page size, since the emulator can no longer use the MMU and
the normal system calls for controlling page protections. Of course,
the emulator can emulate the MMU by checking and possibly remapping
the address for each memory access in software, but that is pretty
slow.
This provides a facility for such programs to control the access
permissions on individual 4k sub-pages of 64k pages. The idea is
that the emulator supplies an array of protection masks to apply to a
specified range of virtual addresses. These masks are applied at the
level where hardware PTEs are inserted into the hardware page table
based on the Linux PTEs, so the Linux PTEs are not affected. Note
that this new mechanism does not allow any access that would otherwise
be prohibited; it can only prohibit accesses that would otherwise be
allowed. This new facility is only available on 64-bit PowerPC and
only when the kernel is configured for 64k pages.
The masks are supplied using a new subpage_prot system call, which
takes a starting virtual address and length, and a pointer to an array
of protection masks in memory. The array has a 32-bit word per 64k
page to be protected; each 32-bit word consists of 16 2-bit fields,
for which 0 allows any access (that is otherwise allowed), 1 prevents
write accesses, and 2 or 3 prevent any access.
Implicit in this is that the regions of the address space that are
protected are switched to use 4k hardware pages rather than 64k
hardware pages (on machines with hardware 64k page support). In fact
the whole process is switched to use 4k hardware pages when the
subpage_prot system call is used, but this could be improved in future
to switch only the affected segments.
The subpage protection bits are stored in a 3 level tree akin to the
page table tree. The top level of this tree is stored in a structure
that is appended to the top level of the page table tree, i.e., the
pgd array. Since it will often only be 32-bit addresses (below 4GB)
that are protected, the pointers to the first four bottom level pages
are also stored in this structure (each bottom level page contains the
protection bits for 1GB of address space), so the protection bits for
addresses below 4GB can be accessed with one fewer loads than those
for higher addresses.
Signed-off-by: Paul Mackerras <paulus@samba.org>
When demoting a process to use 4K HW pages (instead of 64K), which
happens under various circumstances such as doing cache inhibited
mappings on machines that do not support 64K CI pages, the assembly
hash code calls back into the C function flush_hash_page(). This
function prototype was recently changed to accomodate for 1T segments
but the assembly call site was not updated, causing applications that
do demotion to hang. In addition, when updating the per-CPU PACA for
the new sizes, we didn't properly update the slice "map", thus causing
the SLB miss code to re-insert segments for the wrong size.
This fixes both and adds a warning comment next to the C
implementation to try to avoid problems next time someone changes it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This makes the kernel use 1TB segments for all kernel mappings and for
user addresses of 1TB and above, on machines which support them
(currently POWER5+, POWER6 and PA6T).
We detect that the machine supports 1TB segments by looking at the
ibm,processor-segment-sizes property in the device tree.
We don't currently use 1TB segments for user addresses < 1T, since
that would effectively prevent 32-bit processes from using huge pages
unless we also had a way to revert to using 256MB segments. That
would be possible but would involve extra complications (such as
keeping track of which segment size was used when HPTEs were inserted)
and is not addressed here.
Parts of this patch were originally written by Ben Herrenschmidt.
Signed-off-by: Paul Mackerras <paulus@samba.org>
The code for mapping special 4k pages on kernels using a 64kB base
page size was missing the code for doing the RPN (real page number)
manipulation when inserting the hardware PTE in the secondary hash
bucket. It needs the same code as has already been added to the
code that inserts the HPTE in the primary hash bucket. This adds it.
Spotted by Ben Herrenschmidt.
Signed-off-by: Paul Mackerras <paulus@samba.org>
This adds the ability for a kernel compiled with 4K page size
to have special slices containing 64K pages and hash the right type
of hash PTEs.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Some drivers have resources that they want to be able to map into
userspace that are 4k in size. On a kernel configured with 64k pages
we currently end up mapping the 4k we want plus another 60k of
physical address space, which could contain anything. This can
introduce security problems, for example in the case of an infiniband
adaptor where the other 60k could contain registers that some other
program is using for its communications.
This patch adds a new function, remap_4k_pfn, which drivers can use to
map a single 4k page to userspace regardless of whether the kernel is
using a 4k or a 64k page size. Like remap_pfn_range, it would
typically be called in a driver's mmap function. It only maps a
single 4k page, which on a 64k page kernel appears replicated 16 times
throughout a 64k page. On a 4k page kernel it reduces to a call to
remap_pfn_range.
The way this works on a 64k kernel is that a new bit, _PAGE_4K_PFN,
gets set on the linux PTE. This alters the way that __hash_page_4K
computes the real address to put in the HPTE. The RPN field of the
linux PTE becomes the 4k RPN directly rather than being interpreted as
a 64k RPN. Since the RPN field is 32 bits, this means that physical
addresses being mapped with remap_4k_pfn have to be below 2^44,
i.e. 0x100000000000.
The patch also factors out the code in arch/powerpc/mm/hash_utils_64.c
that deals with demoting a process to use 4k pages into one function
that gets called in the various different places where we need to do
that. There were some discrepancies between exactly what was done in
the various places, such as a call to spu_flush_all_slbs in one case
but not in others.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Some POWER5+ machines can do 64k hardware pages for normal memory but
not for cache-inhibited pages. This patch lets us use 64k hardware
pages for most user processes on such machines (assuming the kernel
has been configured with CONFIG_PPC_64K_PAGES=y). User processes
start out using 64k pages and get switched to 4k pages if they use any
non-cacheable mappings.
With this, we use 64k pages for the vmalloc region and 4k pages for
the imalloc region. If anything creates a non-cacheable mapping in
the vmalloc region, the vmalloc region will get switched to 4k pages.
I don't know of any driver other than the DRM that would do this,
though, and these machines don't have AGP.
When a region gets switched from 64k pages to 4k pages, we do not have
to clear out all the 64k HPTEs from the hash table immediately. We
use the _PAGE_COMBO bit in the Linux PTE to indicate whether the page
was hashed in as a 64k page or a set of 4k pages. If hash_page is
trying to insert a 4k page for a Linux PTE and it sees that it has
already been inserted as a 64k page, it first invalidates the 64k HPTE
before inserting the 4k HPTE. The hash invalidation routines also use
the _PAGE_COMBO bit, to determine whether to look for a 64k HPTE or a
set of 4k HPTEs to remove. With those two changes, we can tolerate a
mix of 4k and 64k HPTEs in the hash table, and they will all get
removed when the address space is torn down.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Our MMU hash management code would not set the "C" bit (changed bit) in
the hardware PTE when updating a RO PTE into a RW PTE. That would cause
the hardware to possibly to a write back to the hash table to set it on
the first store access, which in addition to being a performance issue,
might also hit a bug when running with native hash management (non-HV)
as our code is specifically optimized for the case where no write back
happens.
Thus there is a very small therocial window were a hash PTE can become
corrupted if that HPTE has just been upgraded to read write, a store
access happens on it, and that races with another processor evicting
that same slot. Since eviction (caused by an almost full hash) is
extremely rare, the bug is very unlikely to happen fortunately.
This fixes by allowing the updating of the protection bits in the native
hash handling to also set (but not clear) the "C" bit, and, in order to
also improve performances in the general case, by always setting that
bit on newly inserted hash PTE so that writeback really never happens.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Adds a new CONFIG_PPC_64K_PAGES which, when enabled, changes the kernel
base page size to 64K. The resulting kernel still boots on any
hardware. On current machines with 4K pages support only, the kernel
will maintain 16 "subpages" for each 64K page transparently.
Note that while real 64K capable HW has been tested, the current patch
will not enable it yet as such hardware is not released yet, and I'm
still verifying with the firmware architects the proper to get the
information from the newer hypervisors.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This moves the remaining files in arch/ppc64/mm to arch/powerpc/mm,
and arranges that we use them when compiling with ARCH=ppc64.
Signed-off-by: Paul Mackerras <paulus@samba.org>