The iterate_irefs in backref.c is used to build path components from inode
refs. This patch adds code to iterate extended refs as well.
I had modify the callback function signature to abstract out some of the
differences between ref structures. iref_to_path() also needed similar
changes.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
This patch adds basic support for extended inode refs. This includes support
for link and unlink of the refs, which basically gets us support for rename
as well.
Inode creation does not need changing - extended refs are only added after
the ref array is full.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIVAwUAUHPmWxOxKuMESys7AQKOvw//XnLQRzin9nWB91Dx9G5ZAZ3f1t5YFI31
mocFdgeuP1+qg1w4/dED1xnqODh9Sbi6gBsFa3wIAIcK/qkqwu8TCbFQDkjZsJkZ
j/ZghiyrrMuFn71m2+cWpAcyoEKdtmfsd352lXOW6ACP21Yth/64GsJJpdM7ywVo
K0mqtXcA0GDGioF9bc6p/fk/fS1V4X5dYle1hx9Pvsk1qxFrYNpkvBLkMxccn0Sf
SUNLO1p4NIlfSjyO/7A5FwDGGkP4RzDAZ/Sno9+4tBqZ3wyTftfBRiocnK4VTfAW
NhDWinIHwa+uAXK5A0wusPHHpmrVv7Caqda2pkkNmU8MtbB8NRJsMhM+xG+xZN8C
UAESm69Ey2xbR8QNG7HQadCXywcIlDGGvvXbUgQA0q+WYL23gkUb5JtoEsWn39ce
+m89dCzxz1QpQ0m4uTrCwR7cgs8URfozRFp9UO5yGHX3tNc6oiyeZl2wt/UUm8TJ
oE2xbUuHB+OGxFw8FalJ92mM9cZcfKJxXSSZpUsb8CEIHNEvtNleG4KohynCwowQ
IeRBZaGpDWttgNAia+suK1cunJ7Idvqx0T/aWGDBxAjxrJWLqmh9rMwVigmVf4RP
2TCrQW0cwfEMYvBLwszcfutrbzx/yfLhX+hhP9MTyroHzb6u1oyR1mh3uB4WXLKE
BnMyXQjQOOE=
=sTs2
-----END PGP SIGNATURE-----
Merge tag 'disintegrate-s390-20121009' of
git://git.infradead.org/users/dhowells/linux-headers
Pull UAPI patchset from David Howells:
"Can you merge the following branch into the s390 tree please.
This is to complete part of the UAPI disintegration for which the
preparatory patches were pulled recently."
Conflicts:
arch/s390/include/asm/chpid.h
The load of the svc number in the TIF_SYSCALL restart path needs to be
done with an instruction that loads all 64 bits of %r1, 'lh' only loads
32 bits. If the upper half of %r1 is not zero and has the msb set,
entry64.S will try to execute an svc with a really large number.
What will be in the upper half of %r1 depends on the code generated by
gcc for the functions on the do_signal() callchain.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
vmem_add_mem() should only then insert a large page if pmd_none() is true
for the specific entry. We might have a leftover from a previous mapping.
In addition make vmem_remove_range()'s page table walk code more complete
and fix a couple of potential endless loops (which can never happen :).
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add a special module area on top of the vmalloc area, which may be only
used for modules and bpf jit generated code.
This makes sure that inter module branches will always happen without a
trampoline and in addition having all the code within a 2GB frame is
branch prediction unit friendly.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Remove duplicated include.
dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
No need for an ifdef __KERNEL__ since css_chars.h is not exported.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Make sure that exported headers are save to be included by userspace
exploiting /dev/chsc.
Reported-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Let the kernel text section always begin at 1MB. This allows to always have
a large frame in the identity mapping of the kernel image for beginning of
the text section, if the machine has EDAT1 support.
Moving the beginning from 64K to 1MB doesn't cost any memory, since we make
the memory between 64K and 1MB available for the page allocator.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Within the identity mapping the kernel text section is mapped read-only.
However when mapping the first and last page of the text section we must
round upwards and downwards respectively, if only parts of a page belong
to the section.
Otherwise potential rw data can be mapped read-only. So the rounding must
be done just the other way we have it right now.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
This is more or less the same as the x86 page table dumper which was
merged four years ago: 926e5392 "x86: add code to dump the (kernel)
page tables for visual inspection by kernel developers".
We add a file at /sys/kernel/debug/kernel_page_tables for debugging
purposes so it's quite easy to see the kernel page table layout and
possible odd mappings:
---[ Identity Mapping ]---
0x0000000000000000-0x0000000000100000 1M PTE RW
---[ Kernel Image Start ]---
0x0000000000100000-0x0000000000800000 7M PMD RO
0x0000000000800000-0x00000000008a9000 676K PTE RO
0x00000000008a9000-0x0000000000900000 348K PTE RW
0x0000000000900000-0x0000000001500000 12M PMD RW
---[ Kernel Image End ]---
0x0000000001500000-0x0000000280000000 10219M PMD RW
0x0000000280000000-0x000003d280000000 3904G PUD I
---[ vmemmap Area ]---
0x000003d280000000-0x000003d288c00000 140M PTE RW
0x000003d288c00000-0x000003d300000000 1908M PMD I
0x000003d300000000-0x000003e000000000 52G PUD I
---[ vmalloc Area ]---
0x000003e000000000-0x000003e000009000 36K PTE RW
0x000003e000009000-0x000003e0000ee000 916K PTE I
0x000003e0000ee000-0x000003e000146000 352K PTE RW
0x000003e000146000-0x000003e000200000 744K PTE I
0x000003e000200000-0x000003e080000000 2046M PMD I
0x000003e080000000-0x0000040000000000 126G PUD I
This usually makes only sense for kernel developers. The output
with CONFIG_DEBUG_PAGEALLOC is not very helpful, because of the
huge number of mapped out pages, however I decided for the time
being to not add a !DEBUG_PAGEALLOC dependency.
Maybe it's helpful for somebody even with that option.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Do the switch to z/Architecture (alias 64 bit) mode early in head.S.
If the machine is already running in 64 bit mode the sigp turns into
a nop. With this change it doesn't matter in which mode the kernel
is started.
Reviewd-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Remove some EXPORT_SYMBOLs. There is no module code anywhere
which calls these pageattr functions.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The current page table walk code in pageattr.c only checks for large pages
while walking the kernel page table, but happily assumes that everything
else is just fine.
Add more checks so we never access invalid memory regions.
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
pmd_huge() will always return 0 on !HUGETLBFS, however we use that helper
function when walking the kernel page tables to decide if we have a
1MB page frame or not.
Since we create 1MB frames for the kernel 1:1 mapping independently of
HUGETLBFS this can lead to incorrect storage accesses since the code
can assume that we have a pointer to a page table instead of a pointer
to a 1MB frame.
Fix this by adding a pmd_large() primitive like other architectures have
it already and remove all references to HUGETLBFS/HUGETLBPAGE from the
code that walks kernel page tables.
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Let the driver core handle device attribute creation and removal. This
will simplify the code and eliminates races between attribute
availability and userspace notification via uevents.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Make use of the pfmf instruction, if available, to initialize storage keys
of whole 1MB or 2GB frames instead of initializing every single page with
the sske instruction.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
MACHINE_HAS_PFMF and MACHINE_HAS_HPAGE actually have the same semantics:
the cpu has the EDAT1 facility installed in zArch mode. So remove one
of the feature flags in machine_flags and rename the other one to EDAT1.
The two old macros simply get mapped to MACHINE_HAS_EDAT1.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The VM_RESERVED flag was killed off in commit 314e51b985 ("mm: kill
vma flag VM_RESERVED and mm->reserved_vm counter"), and replaced by the
proper semantic flags (eg "don't core-dump" etc). But there was a new
use of VM_RESERVED that got missed by the merge.
Fix the remaining use of VM_RESERVED in the vfio_pci driver, replacing
the VM_RESERVED flag with VM_DONTEXPAND | VM_DONTDUMP.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation,org>
i218 is the next-generation LOM that will be available on systems with the
Lynx Point LP Platform Controller Hub (PCH) chipset from Intel. This patch
provides the initial support of those devices.
Signed-off-by: Bruce Allan <bruce.w.allan@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This change limits the PF/VF driver to 9.5K max jumbo frame size in order
prevent a possible Tx hang in the adapter when sending frames between
pools.
All of the parts in ixgbe support a maximum frame of 15.5K for standard
traffic, however with SR-IOV or DCB enabled they should be limiting the
MTU size to 9.5K. Instead of adding extra checks which would have to
change the MTU when we go into or out of these modes it is preferred to
just use a standard 9.5K MTU limit for all modes so that this extra
overhead can be avoided.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The driver was not setting the number of real Tx queues in the net_device
structure. This caused some serious issues such as Tx hangs and extremely
poor performance with some usages of the driver.
The issue is best observed by running:
iperf -c <host> -P <n>
Where n is greater than one. The greater the value of n the more likely
the problem is to show up.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Kerrisk <mtk.manpages@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
Merge patches from Andrew Morton:
"A few misc things and very nearly all of the MM tree. A tremendous
amount of stuff (again), including a significant rbtree library
rework."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (160 commits)
sparc64: Support transparent huge pages.
mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd().
mm: Add and use update_mmu_cache_pmd() in transparent huge page code.
sparc64: Document PGD and PMD layout.
sparc64: Eliminate PTE table memory wastage.
sparc64: Halve the size of PTE tables
sparc64: Only support 4MB huge pages and 8KB base pages.
memory-hotplug: suppress "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" warning
mm: memcg: clean up mm_match_cgroup() signature
mm: document PageHuge somewhat
mm: use %pK for /proc/vmallocinfo
mm, thp: fix mlock statistics
mm, thp: fix mapped pages avoiding unevictable list on mlock
memory-hotplug: update memory block's state and notify userspace
memory-hotplug: preparation to notify memory block's state at memory hot remove
mm: avoid section mismatch warning for memblock_type_name
make GFP_NOTRACK definition unconditional
cma: decrease cc.nr_migratepages after reclaiming pagelist
CMA: migrate mlocked pages
kpageflags: fix wrong KPF_THP on non-huge compound pages
...
This is relatively easy since PMD's now cover exactly 4MB of memory.
Our PMD entries are 32-bits each, so we use a special encoding. The
lowest bit, PMD_ISHUGE, determines the interpretation. This is possible
because sparc64's page tables are purely software entities so we can use
whatever encoding scheme we want. We just have to make the TLB miss
assembler page table walkers aware of the layout.
set_pmd_at() works much like set_pte_at() but it has to operate in two
page from a table of non-huge PTEs, so we have to queue up TLB flushes
based upon what mappings are valid in the PTE table. In the second regime
we are going from huge-page to non-huge-page, and in that case we need
only queue up a single TLB flush to push out the huge page mapping.
We still have 5 bits remaining in the huge PMD encoding so we can very
likely support any new pieces of THP state tracking that might get added
in the future.
With lots of help from Johannes Weiner.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Invalidation sequences are handled in various ways on various
architectures.
One way, which sparc64 uses, is to let the set_*_at() functions accumulate
pending flushes into a per-cpu array. Then the flush_tlb_range() et al.
calls process the pending TLB flushes.
In this regime, the __tlb_remove_*tlb_entry() implementations are
essentially NOPs.
The canonical PTE zap in mm/memory.c is:
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);
With a subsequent tlb_flush_mmu() if needed.
Mirror this in the THP PMD zapping using:
orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
page = pmd_page(orig_pmd);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
And we properly accomodate TLB flush mechanims like the one described
above.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The transparent huge page code passes a PMD pointer in as the third
argument of update_mmu_cache(), which expects a PTE pointer.
This never got noticed because X86 implements update_mmu_cache() as a
macro and thus we don't get any type checking, and X86 is the only
architecture which supports transparent huge pages currently.
Before other architectures can support transparent huge pages properly we
need to add a new interface which will take a PMD pointer as the third
argument rather than a PTE pointer.
[akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390]
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We're going to be messing around with the PMD interpretation and layout
for the sake of transparent huge pages, so we better clearly document what
we're starting with.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We've split up the PTE tables so that they take up half a page instead of
a full page. This is in order to facilitate transparent huge page
support, which works much better if our PMDs cover 4MB instead of 8MB.
What we do is have a one-behind cache for PTE table allocations in the
mm struct.
This logic triggers only on allocations. For example, we don't try to
keep track of free'd up page table blocks in the style that the s390 port
does.
There were only two slightly annoying aspects to this change:
1) Changing pgtable_t to be a "pte_t *". There's all of this special
logic in the TLB free paths that needed adjustments, as did the
PMD populate interfaces.
2) init_new_context() needs to zap the pointer, since the mm struct
just gets copied from the parent on fork.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The reason we want to do this is to facilitate transparent huge page
support.
Right now PMD's cover 8MB of address space, and our huge page size is 4MB.
The current transparent hugepage support is not able to handle HPAGE_SIZE
!= PMD_SIZE.
So make PTE tables be sized to half of a page instead of a full page.
We can still map properly the whole supported virtual address range which
on sparc64 requires 44 bits. Add a compile time CPP test which ensures
that this requirement is always met.
There is a minor inefficiency added by this change. We only use half of
the page for PTE tables. It's not trivial to use only half of the page
yet still get all of the pgtable_page_{ctor,dtor}() stuff working
properly. It is doable, and that will come in a subsequent change.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Narrowing the scope of the page size configurations will make the
transparent hugepage changes much simpler.
In the end what we really want to do is have the kernel support multiple
huge page sizes and use whatever is appropriate as the context dictactes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When our x86 box calls __remove_pages(), release_mem_region() shows many
warnings. And x86 box cannot unregister iomem_resource.
"Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>"
release_mem_region() has been changed to be called in each
PAGES_PER_SECTION by commit de7f0cba96 ("memory hotplug: release
memory regions in PAGES_PER_SECTION chunks"). Because powerpc registers
iomem_resource in each PAGES_PER_SECTION chunk. But when I hot add
memory on x86 box, iomem_resource is register in each _CRS not
PAGES_PER_SECTION chunk. So x86 box unregisters iomem_resource.
The patch fixes the problem.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It really should return a boolean for match/no match. And since it takes
a memcg, not a cgroup, fix that parameter name as well.
[akpm@linux-foundation.org: mm_match_cgroup() is not a macro]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the paranoid case of sysctl kernel.kptr_restrict=2, mask the kernel
virtual addresses in /proc/vmallocinfo too.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Brad Spengler <spender@grsecurity.net>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a transparent hugepage is mapped and it is included in an mlock()
range, follow_page() incorrectly avoids setting the page's mlock bit and
moving it to the unevictable lru.
This is evident if you try to mlock(), munlock(), and then mlock() a
range again. Currently:
#define MAP_SIZE (4 << 30) /* 4GB */
void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
mlock(ptr, MAP_SIZE);
$ grep -E "Unevictable|Inactive\(anon" /proc/meminfo
Inactive(anon): 6304 kB
Unevictable: 4213924 kB
munlock(ptr, MAP_SIZE);
Inactive(anon): 4186252 kB
Unevictable: 19652 kB
mlock(ptr, MAP_SIZE);
Inactive(anon): 4198556 kB
Unevictable: 21684 kB
Notice that less than 2MB was added to the unevictable list; this is
because these pages in the range are not transparent hugepages since the
4GB range was allocated with mmap() and has no specific alignment. If
posix_memalign() were used instead, unevictable would not have grown at
all on the second mlock().
The fix is to call mlock_vma_page() so that the mlock bit is set and the
page is added to the unevictable list. With this patch:
mlock(ptr, MAP_SIZE);
Inactive(anon): 4056 kB
Unevictable: 4213940 kB
munlock(ptr, MAP_SIZE);
Inactive(anon): 4198268 kB
Unevictable: 19636 kB
mlock(ptr, MAP_SIZE);
Inactive(anon): 4008 kB
Unevictable: 4213940 kB
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michel Lespinasse <walken@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>