It overflows the amount of space available in the initial .text section
of trap handler assembler in some configurations, resulting in build
failures.
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds support for 16GB hugepage size. To use this page size
use kernel parameters as:
default_hugepagesz=16G hugepagesz=16G hugepages=10
Testing:
Tested with the stream benchmark which allocates 48G of
arrays backed by 16G hugepages and does RW operation on
them in parallel.
Orabug: 25362942
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
get_user_pages() is used to do direct IO. It already
handles the case where the address range is backed
by PMD huge pages. This patch now adds the case where
the range could be backed by PUD huge pages.
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If an architecture uses 4level-fixup.h we don't need to do anything as
it includes 5level-fixup.h.
If an architecture uses pgtable-nop*d.h, define __ARCH_USE_5LEVEL_HACK
before inclusion of the header. It makes asm-generic code to use
5level-fixup.h.
If an architecture has 4-level paging or folds levels on its own,
include 5level-fixup.h directly.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Update code that relied on sched.h including various MM types for them.
This will allow us to remove the <linux/mm_types.h> include from <linux/sched.h>.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add support for using multiple hugepage sizes simultaneously
on mainline. Currently, support for 256M has been added which
can be used along with 8M pages.
Page tables are set like this (e.g. for 256M page):
VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31]
and TSB is set similarly:
VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63]
- Testing
Tested on Sonoma (which supports 256M pages) by running stream
benchmark instances in parallel: one instance uses 8M pages and
another uses 256M pages, consuming 48G each.
Boot params used:
default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M
hugepages=10000
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It really has to be pgdp, not pgd.
It just happend to work since all callers have 'pgd' as an argument.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For PMD aligned (8M) hugepages, we currently allocate
all four page table levels which is wasteful. We now
allocate till PMD level only which saves memory usage
from page tables.
Also, when freeing page table for 8M hugepage backed region,
make sure we don't try to access non-existent PTE level.
Orabug: 22630259
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull sparc updates from David Miller:
"Some 32-bit kgdb cleanups from Sam Ravnborg, and a hugepage TLB flush
overhead fix on 64-bit from Nitin Gupta"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: Reduce TLB flushes during hugepte changes
aeroflex/greth: fix warning about unused variable
openprom: fix warning
sparc32: drop superfluous cast in calls to __nocache_pa()
sparc32: fix build with STRICT_MM_TYPECHECKS
sparc32: use proper prototype for trapbase
sparc32: drop local prototype in kgdb_32
sparc32: drop hardcoding trap_level in kgdb_trap
During hugepage map/unmap, TSB and TLB flushes are currently
issued at every PAGE_SIZE'd boundary which is unnecessary.
We now issue the flush at REAL_HPAGE_SIZE boundaries only.
Without this patch workloads which unmap a large hugepage
backed VMA region get CPU lockups due to excessive TLB
flush calls.
Orabug: 22365539, 22643230, 22995196
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I've just discovered that the useful-sounding has_transparent_hugepage()
is actually an architecture-dependent minefield: on some arches it only
builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when
not, but on some of those (arm and arm64) it then gives the wrong
answer; and on mips alone it's marked __init, which would crash if
called later (but so far it has not been called later).
Straighten this out: make it available to all configs, with a sensible
default in asm-generic/pgtable.h, removing its definitions from those
arches (arc, arm, arm64, sparc, tile) which are served by the default,
adding #define has_transparent_hugepage has_transparent_hugepage to
those (mips, powerpc, s390, x86) which need to override the default at
runtime, and removing the __init from mips (but maybe that kind of code
should be avoided after init: set a static variable the first time it's
called).
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Ning Qu <quning@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Vineet Gupta <vgupta@synopsys.com> [arch/arc]
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> [arch/s390]
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Reviewed-by: Julian Calaby <julian.calaby@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.
This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Shaohua Li <shli@kernel.org>
Cc: <yalin.wang2010@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jason Evans <je@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mika Penttil <mika.penttila@nextfour.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Shaohua Li <shli@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have confusing functions to clear pmd, pmd_clear_* and pmd_clear. Add
_huge_ to pmdp_clear functions so that we are clear that they operate on
hugepage pte.
We don't bother about other functions like pmdp_set_wrprotect,
pmdp_clear_flush_young, because they operate on PTE bits and hence
indicate they are operating on hugepage ptes
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sparc: Resolve conflict between sparc v9 and M7 on usage of bit 9 of TTE
Bit 9 of TTE is CV (Cacheable in V-cache) on sparc v9 processor while
the same bit 9 is MCDE (Memory Corruption Detection Enable) on M7
processor. This creates a conflicting usage of the same bit. Kernel
sets TTE.cv bit on all pages for sun4v architecture which works well
for sparc v9 but enables memory corruption detection on M7 processor
which is not the intent. This patch adds code to determine if kernel
is running on M7 processor and takes steps to not enable memory
corruption detection in TTE erroneously.
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
LKP has triggered a compiler warning after my recent patch "mm: account
pmd page tables to the process":
mm/mmap.c: In function 'exit_mmap':
>> mm/mmap.c:2857:2: warning: right shift count >= width of type [enabled by default]
The code:
> 2857 WARN_ON(mm_nr_pmds(mm) >
2858 round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
In this, on tile, we have FIRST_USER_ADDRESS defined as 0. round_up() has
the same type -- int. PUD_SHIFT.
I think the best way to fix it is to define FIRST_USER_ADDRESS as unsigned
long. On every arch for consistency.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We've replaced remap_file_pages(2) implementation with emulation. Nobody
creates non-linear mapping anymore.
This patch also increase number of bits availble for swap offset.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As a small zero page, huge zero page should not be accounted in smaps
report as normal page.
For small pages we rely on vm_normal_page() to filter out zero page, but
vm_normal_page() is not designed to handle pmds. We only get here due
hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not
necessary compatible on each and every architecture.
Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will
detect huge zero page for us.
We would need pmd_dirty() helper to do this properly. The patch adds it
to THP-enabled architectures which don't yet have one.
[akpm@linux-foundation.org: use do_div to fix 32-bit build]
Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengwei Yin <yfw.kernel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
swapper_low_pmd_dir and swapper_pud_dir are actually completely
useless and unnecessary.
We just need swapper_pg_dir[]. Naturally the other page table chunks
will be allocated on an as-needed basis. Since the kernel actually
accesses these tables in the PAGE_OFFSET view, there is not even a TLB
locality advantage of placing them in the kernel image.
Use the hard coded vmlinux.ld.S slot for swapper_pg_dir which is
naturally page aligned.
Increase MAX_BANKS to 1024 in order to handle heavily fragmented
virtual guests.
Even with this MAX_BANKS increase, the kernel is 20K+ smaller.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
In order to accomodate embedded per-cpu allocation with large numbers
of cpus and numa nodes, we have to use as much virtual address space
as possible for the vmalloc region. Otherwise we can get things like:
PERCPU: max_distance=0x380001c10000 too large for vmalloc space 0xff00000000
So, once we select a value for PAGE_OFFSET, derive the size of the
vmalloc region based upon that.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
Make sure, at compile time, that the kernel can properly support
whatever MAX_PHYS_ADDRESS_BITS is defined to.
On M7 chips, use a max_phys_bits value of 49.
Based upon a patch by Bob Picco.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
If max_phys_bits needs to be > 43 (f.e. for T4 chips), things like
DEBUG_PAGEALLOC stop working because the 3-level page tables only
can cover up to 43 bits.
Another problem is that when we increased MAX_PHYS_ADDRESS_BITS up to
47, several statically allocated tables became enormous.
Compounding this is that we will need to support up to 49 bits of
physical addressing for M7 chips.
The two tables in question are sparc64_valid_addr_bitmap and
kpte_linear_bitmap.
The first holds a bitmap, with 1 bit for each 4MB chunk of physical
memory, indicating whether that chunk actually exists in the machine
and is valid.
The second table is a set of 2-bit values which tell how large of a
mapping (4MB, 256MB, 2GB, 16GB, respectively) we can use at each 256MB
chunk of ram in the system.
These tables are huge and take up an enormous amount of the BSS
section of the sparc64 kernel image. Specifically, the
sparc64_valid_addr_bitmap is 4MB, and the kpte_linear_bitmap is 128K.
So let's solve the space wastage and the DEBUG_PAGEALLOC problem
at the same time, by using the kernel page tables (as designed) to
manage this information.
We have to keep using large mappings when DEBUG_PAGEALLOC is disabled,
and we do this by encoding huge PMDs and PUDs.
On a T4-2 with 256GB of ram the kernel page table takes up 16K with
DEBUG_PAGEALLOC disabled and 256MB with it enabled. Furthermore, this
memory is dynamically allocated at run time rather than coded
statically into the kernel image.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
This has become necessary with chips that support more than 43-bits
of physical addressing.
Based almost entirely upon a patch by Bob Picco.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Bob Picco <bob.picco@oracle.com>
Drop extern for all prototypes and adjust alignment of parameters
as required after the removal.
In a few rare cases adjust linelength to conform to maximum 80 chars,
and likewise in a few rare cases adjust alignment of parameters
to static functions.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Access to the TSB hash tables during TLB misses requires that there be
an atomic 128-bit quad load available so that we fetch a matching TAG
and DATA field at the same time.
On cpus prior to UltraSPARC-III only virtual address based quad loads
are available. UltraSPARC-III and later provide physical address
based variants which are easier to use.
When we only have virtual address based quad loads available this
means that we have to lock the TSB into the TLB at a fixed virtual
address on each cpu when it runs that process. We can't just access
the PAGE_OFFSET based aliased mapping of these TSBs because we cannot
take a recursive TLB miss inside of the TLB miss handler without
risking running out of hardware trap levels (some trap combinations
can be deep, such as those generated by register window spill and fill
traps).
Without huge pages it's working perfectly fine, but when the huge TSB
got added another chunk of fixed virtual address space was not
allocated for this second TSB mapping.
So we were mapping both the 8K and 4MB TSBs to the same exact virtual
address, causing multiple TLB matches which gives undefined behavior.
Signed-off-by: David S. Miller <davem@davemloft.net>
pte_ERROR() is not used anywhere, delete it.
For pgd_ERROR() and pmd_ERROR(), output something similar to x86, giving the address
of the pgd/pmd as well as it's value.
Also provide the caller, since these macros are invoked from pgd_clear_bad() and
pmd_clear_bad() which provides little context as to what high level operation was
occuring when the BAD state was detected.
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of returning false we should at least check the most basic
things, otherwise page table corruptions will be very difficult to
debug.
PMD and PTE tables are of size PAGE_SIZE, so none of the sub-PAGE_SIZE
bits should be set.
We also complement this with a check that the physical address the
pud/pmd points to is valid memory.
PowerPC was used as a guide while implementating this.
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit b2d4383480 ("sparc64: Make
PAGE_OFFSET variable."), the MAX_PHYS_ADDRESS_BITS value was increased
(to 47).
This constant reference to '41UL' was missed.
Signed-off-by: David S. Miller <davem@davemloft.net>
The large PMD path needs to check _PAGE_VALID not _PAGE_PRESENT, to
decide if it needs to bail and return 0.
pmd_large() should therefore just check _PAGE_PMD_HUGE.
Calls to gup_huge_pmd() are guarded with a check of pmd_large(), so we
just need to add a valid bit check.
Signed-off-by: David S. Miller <davem@davemloft.net>
On sparc64 "present" and "valid" are seperate PTE bits, this allows us to
naturally distinguish between the user explicitly asking for PROT_NONE
with mprotect() and other situations.
However we weren't handling this properly in the huge PMD paths.
First of all, the page table walker in the TSB miss path only checks
for _PAGE_PMD_HUGE. So the generic pmdp_invalidate() would clear
_PAGE_PRESENT but the TLB miss paths would still load it into the TLB
as a valid huge PMD.
Fix this by clearing the valid bit in pmdp_invalidate(), and also
checking the valid bit in USER_PGTABLE_CHECK_PMD_HUGE using "brgez"
since _PAGE_VALID is bit 63 in both the sun4u and sun4v pte layouts.
Signed-off-by: David S. Miller <davem@davemloft.net>
There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.
The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.
During that time, a CPU may continue writing to the page.
This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.
When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.
This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).
The basic race looks like this:
CPU A CPU B CPU C
load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data
[*] the old page may belong to a new user at this point!
The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.
This should fix both NUMA migration and compaction.
[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that we have 64-bits for PMDs we can stop using special encodings
for the huge PMD values, and just put real PTEs in there.
We allocate a _PAGE_PMD_HUGE bit to distinguish between plain PMDs and
huge ones. It is the same for both 4U and 4V PTE layouts.
We also use _PAGE_SPECIAL to indicate the splitting state, since a
huge PMD cannot also be special.
All of the PMD --> PTE translation code disappears, and most of the
huge PMD bit modifications and tests just degenerate into the PTE
operations. In particular USER_PGTABLE_CHECK_PMD_HUGE becomes
trivial.
As a side effect, normal PMDs don't shift the physical address around.
This also speeds up the page table walks in the TLB miss paths since
they don't have to do the shifts any more.
Another non-trivial aspect is that pte_modify() has to be changed
to preserve the _PAGE_PMD_HUGE bits as well as the page size field
of the pte.
Signed-off-by: David S. Miller <davem@davemloft.net>
To make the page tables compact, we were using 32-bit PGDs and PMDs.
We only had to support <= 43 bits of physical addresses so this was
quite feasible.
In order to support larger physical addresses we have to move to
64-bit PGDs and PMDs.
Most of the changes are straight-forward:
1) {pgd,pmd}_t --> unsigned long
2) Anything that tries to use plain "unsigned int" types with pgd/pmd
values needs to be adjusted. In particular things like "0U" become
"0UL".
3) {PGDIR,PMD}_BITS decrease by one.
4) In the assembler page table walkers, use "ldxa" instead of "lduwa"
and adjust the low bit masks to clear out the low 3 bits instead of
just the low 2 bits during pgd/pmd address formation.
Also, use PTRS_PER_PGD and PTRS_PER_PMD in the sizing of the
swapper_{pg_dir,low_pmd_dir} arrays.
This patch does not try to take advantage of having 64-bits in the
PMDs to simplify the hugepage code, that will come in a subsequent
change.
Signed-off-by: David S. Miller <davem@davemloft.net>
The impetus for this is that we would like to move to 64-bit PMDs and
PGDs, but that would result in only supporting a 42-bit address space
with the current page table layout. It'd be nice to support at least
43-bits.
The reason we'd end up with only 42-bits after making PMDs and PGDs
64-bit is that we only use half-page sized PTE tables in order to make
PMDs line up to 4MB, the hardware huge page size we use.
So what we do here is we make huge pages 8MB, and fabricate them using
4MB hw TLB entries.
Facilitate this by providing a "REAL_HPAGE_SHIFT" which is used in
places that really need to operate on hardware 4MB pages.
Use full pages (512 entries) for PTE tables, and adjust PMD_SHIFT,
PGD_SHIFT, and the build time CPP test as needed. Use a CPP test to
make sure REAL_HPAGE_SHIFT and the _PAGE_SZHUGE_* we use match up.
This makes the pgtable cache completely unused, so remove the code
managing it and the state used in mm_context_t. Now we have less
spinlocks taken in the page table allocation path.
The technique we use to fabricate the 8MB pages is to transfer bit 22
from the missing virtual address into the PTEs physical address field.
That takes care of the transparent huge pages case.
For hugetlb, we fill things in at the PTE level and that code already
puts the sub huge page physical bits into the PTEs, based upon the
offset, so there is nothing special we need to do. It all just works
out.
So, a small amount of complexity in the THP case, but this code is
about to get much simpler when we move the 64-bit PMDs as we can move
away from the fancy 32-bit huge PMD encoding and just put a real PTE
value in there.
With bug fixes and help from Bob Picco.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull powerpc updates from Ben Herrenschmidt:
"This is the powerpc changes for the 3.11 merge window. In addition to
the usual bug fixes and small updates, the main highlights are:
- Support for transparent huge pages by Aneesh Kumar for 64-bit
server processors. This allows the use of 16M pages as transparent
huge pages on kernels compiled with a 64K base page size.
- Base VFIO support for KVM on power by Alexey Kardashevskiy
- Wiring up of our nvram to the pstore infrastructure, including
putting compressed oopses in there by Aruna Balakrishnaiah
- Move, rework and improve our "EEH" (basically PCI error handling
and recovery) infrastructure. It is no longer specific to pseries
but is now usable by the new "powernv" platform as well (no
hypervisor) by Gavin Shan.
- I fixed some bugs in our math-emu instruction decoding and made it
usable to emulate some optional FP instructions on processors with
hard FP that lack them (such as fsqrt on Freescale embedded
processors).
- Support for Power8 "Event Based Branch" facility by Michael
Ellerman. This facility allows what is basically "userspace
interrupts" for performance monitor events.
- A bunch of Transactional Memory vs. Signals bug fixes and HW
breakpoint/watchpoint fixes by Michael Neuling.
And more ... I appologize in advance if I've failed to highlight
something that somebody deemed worth it."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
pstore: Add hsize argument in write_buf call of pstore_ftrace_call
powerpc/fsl: add MPIC timer wakeup support
powerpc/mpic: create mpic subsystem object
powerpc/mpic: add global timer support
powerpc/mpic: add irq_set_wake support
powerpc/85xx: enable coreint for all the 64bit boards
powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx
powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig
powerpc/mpic: Add get_version API both for internal and external use
powerpc: Handle both new style and old style reserve maps
powerpc/hw_brk: Fix off by one error when validating DAWR region end
powerpc/pseries: Support compression of oops text via pstore
powerpc/pseries: Re-organise the oops compression code
pstore: Pass header size in the pstore write callback
powerpc/powernv: Fix iommu initialization again
powerpc/pseries: Inform the hypervisor we are using EBB regs
powerpc/perf: Add power8 EBB support
powerpc/perf: Core EBB support for 64-bit book3s
powerpc/perf: Drop MMCRA from thread_struct
powerpc/perf: Don't enable if we have zero events
...
This will be later used by powerpc THP support. In powerpc we want to use
pgtable for storing the hash index values. So instead of adding them to
mm_context list, we would like to store them in the second half of pmd
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
As reported by Dave Kleikamp, when we emit cross calls to do batched
TLB flush processing we have a race because we do not synchronize on
the sibling cpus completing the cross call.
So meanwhile the TLB batch can be reset (tb->tlb_nr set to zero, etc.)
and either flushes are missed or flushes will flush the wrong
addresses.
Fix this by using generic infrastructure to synchonize on the
completion of the cross call.
This first required getting the flush_tlb_pending() call out from
switch_to() which operates with locks held and interrupts disabled.
The problem is that smp_call_function_many() cannot be invoked with
IRQs disabled and this is explicitly checked for with WARN_ON_ONCE().
We get the batch processing outside of locked IRQ disabled sections by
using some ideas from the powerpc port. Namely, we only batch inside
of arch_{enter,leave}_lazy_mmu_mode() calls. If we're not in such a
region, we flush TLBs synchronously.
1) Get rid of xcall_flush_tlb_pending and per-cpu type
implementations.
2) Do TLB batch cross calls instead via:
smp_call_function_many()
tlb_pending_func()
__flush_tlb_pending()
3) Batch only in lazy mmu sequences:
a) Add 'active' member to struct tlb_batch
b) Define __HAVE_ARCH_ENTER_LAZY_MMU_MODE
c) Set 'active' in arch_enter_lazy_mmu_mode()
d) Run batch and clear 'active' in arch_leave_lazy_mmu_mode()
e) Check 'active' in tlb_batch_add_one() and do a synchronous
flush if it's clear.
4) Add infrastructure for synchronous TLB page flushes.
a) Implement __flush_tlb_page and per-cpu variants, patch
as needed.
b) Likewise for xcall_flush_tlb_page.
c) Implement smp_flush_tlb_page() to invoke the cross-call.
d) Wire up global_flush_tlb_page() to the right routine based
upon CONFIG_SMP
5) It turns out that singleton batches are very common, 2 out of every
3 batch flushes have only a single entry in them.
The batch flush waiting is very expensive, both because of the poll
on sibling cpu completeion, as well as because passing the tlb batch
pointer to the sibling cpus invokes a shared memory dereference.
Therefore, in flush_tlb_pending(), if there is only one entry in
the batch perform a completely asynchronous global_flush_tlb_page()
instead.
Reported-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Mostly mirrors the s390 logic, as unlike x86 we don't need the
SetPageReferenced() bits.
On sparc64 we also lack a user/privileged bit in the huge PMDs.
In order to make this work for THP and non-THP builds, some header
file adjustments were necessary. Namely, provide the PMD_HUGE_* bit
defines and the pmd_large() inline unconditionally rather than
protected by TRANSPARENT_HUGEPAGE.
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can elide flush_tlb_*() calls when _PAGE_VALID is clear
as that is the test used to determine whether or not to
queue up a TLB flush in set_pte_at().
Signed-off-by: David S. Miller <davem@davemloft.net>
This is relatively easy since PMD's now cover exactly 4MB of memory.
Our PMD entries are 32-bits each, so we use a special encoding. The
lowest bit, PMD_ISHUGE, determines the interpretation. This is possible
because sparc64's page tables are purely software entities so we can use
whatever encoding scheme we want. We just have to make the TLB miss
assembler page table walkers aware of the layout.
set_pmd_at() works much like set_pte_at() but it has to operate in two
page from a table of non-huge PTEs, so we have to queue up TLB flushes
based upon what mappings are valid in the PTE table. In the second regime
we are going from huge-page to non-huge-page, and in that case we need
only queue up a single TLB flush to push out the huge page mapping.
We still have 5 bits remaining in the huge PMD encoding so we can very
likely support any new pieces of THP state tracking that might get added
in the future.
With lots of help from Johannes Weiner.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're going to be messing around with the PMD interpretation and layout
for the sake of transparent huge pages, so we better clearly document what
we're starting with.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The reason we want to do this is to facilitate transparent huge page
support.
Right now PMD's cover 8MB of address space, and our huge page size is 4MB.
The current transparent hugepage support is not able to handle HPAGE_SIZE
!= PMD_SIZE.
So make PTE tables be sized to half of a page instead of a full page.
We can still map properly the whole supported virtual address range which
on sparc64 requires 44 bits. Add a compile time CPP test which ensures
that this requirement is always met.
There is a minor inefficiency added by this change. We only use half of
the page for PTE tables. It's not trivial to use only half of the page
yet still get all of the pgtable_page_{ctor,dtor}() stuff working
properly. It is doable, and that will come in a subsequent change.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Narrowing the scope of the page size configurations will make the
transparent hugepage changes much simpler.
In the end what we really want to do is have the kernel support multiple
huge page sizes and use whatever is appropriate as the context dictactes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>