This patch switch the ppc arch to use the generic RCU based
gup implementation.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PAGE_FACTOR was defined to reflect the difference between configured
page size and fixed 4KB page size. Replace (PAGE_SHIFT - HW_PAGE_SHIFT)
with PAGE_FACTOR.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We did part of sparse initialisation in setup_arch and part in
initmem_init. Put them together.
Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Lots of places included bootmem.h even when not using bootmem.
Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now bootmem is gone from powerpc we can remove comments mentioning it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment we transition from the memblock alloctor to the bootmem
allocator. Gitting rid of the bootmem allocator removes a bunch of
complicated code (most of which I owe the dubious honour of being
responsible for writing).
Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Emil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
8xx sometimes need to load a invalid/non-present TLBs in
it DTLB asm handler.
These must be invalidated separaly as linux mm doesn't.
Commit 5efab4a02c was invalidating them in
arch/powerpc/mm/fault.c.
This patch does the invalidation earlier in order to free the TLB as soon as
possible. This also has the advantage of removing some 8xx specific code from
fault.c
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Scott Wood <scottwood@freescale.com>
We have an extra level of indirection on memory hot remove which is not
matched on memory hot add. Memory hotplug is book3s only, so there is
no need for it.
This also enables means remove_memory() (ie memory hot unplug) works
on powernv.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This still has not been merged and now powerpc is the only arch that does
not have this change. Sorry about missing linuxppc-dev before.
V2->V2
- Fix up to work against 3.18-rc1
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.
Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.
__get_cpu_var() is defined as :
__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.
this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.
This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.
At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.
The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e. using a global
register that may be set to the per cpu base.
Transformations done to __get_cpu_var()
1. Determine the address of the percpu instance of the current processor.
DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(&y);
2. Same as #1 but this time an array structure is involved.
DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);
Converts to
int *x = this_cpu_ptr(y);
3. Retrieve the content of the current processors instance of a per cpu
variable.
DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)
Converts to
int x = __this_cpu_read(y);
4. Retrieve the content of a percpu struct
DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);
Converts to
memcpy(&x, this_cpu_ptr(&y), sizeof(x));
5. Assignment to a per cpu variable
DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;
Converts to
__this_cpu_write(y, x);
6. Increment/Decrement etc of a per cpu variable
DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++
Converts to
__this_cpu_inc(y)
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
[mpe: Fix build errors caused by set/or_softirq_pending(), and rework
assignment in __set_breakpoint() to use memcpy().]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add __init to MMU_setup() which uses __initdata boot_command_line.
Also MMU_setup() is only called from MMU_init(), which is also __init.
Warning appeared since commit 3e47d1474c.
Fixes: 3e47d1474c ("powerpc: Remove powerpc specific cmd_line")
Signed-off-by: Fabian Frederick <fabf@skynet.be>
[mpe: Update changelog]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We received a report of warning in kernel/sched/core.c where the sched
group was NULL on an LPAR after a topology update. This seems to occur
because after the topology update has moved the CPUs, cpu_to_node is
returning the old value still, which ends up breaking the consistency of
the NUMA topology in the per-cpu maps. Ensure that we update the per-cpu
fields when we re-map CPUs.
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There isn't any need to keep referring to update->cpu, as we've already
checked cpu == update->cpu at this point.
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch makes copro_calculate_slb() mask the ESID by the correct mask
for 1T vs 256M segments.
This has no effect by itself as the extra bits were ignored, but it
makes debugging the segment table entries easier and means that we can
directly compare the ESID values for duplicates without needing to worry
about masking in the comparison.
This will be used to simplify a comparison in the following patch.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
arch/powerpc/mm/slice.c:704:5: error: expected identifier or ‘(’ before numeric constant
int is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
^
make[1]: *** [arch/powerpc/mm/slice.o] Error 1
make: *** [arch/powerpc/mm/slice.o] Error 2
This got introduced via 1217d34b53
"powerpc: Ensure global functions include their prototype". We
started including linux/hugetlb.h with that patch and now we have
#define is_hugepage_only_range(mm, addr, len) 0
with hugetlbfs disabled.
Fixes: 1217d34b53 ("powerpc: Ensure global functions include their prototype")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull more powerpc updates from Michael Ellerman:
"Here's some more updates for powerpc for 3.18.
They are a bit late I know, though must are actually bug fixes. In my
defence I nearly cut the top of my finger off last weekend in a
gruesome bike maintenance accident, so I spent a good part of the week
waiting around for doctors. True story, I can send photos if you like :)
Probably the most interesting fix is the sys_call_table one, which
enables syscall tracing for powerpc. There's a fix for HMI handling
for old firmware, more endian fixes for firmware interfaces, more EEH
fixes, Anton fixed our routine that gets the current stack pointer,
and a few other misc bits"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (22 commits)
powerpc: Only do dynamic DMA zone limits on platforms that need it
powerpc: sync pseries_le_defconfig with pseries_defconfig
powerpc: Add printk levels to setup_system output
powerpc/vphn: NUMA node code expects big-endian
powerpc/msi: Use WARN_ON() in msi bitmap selftests
powerpc/msi: Fix the msi bitmap alignment tests
powerpc/eeh: Block CFG upon frozen Shiner adapter
powerpc/eeh: Don't collect logs on PE with blocked config space
powerpc/eeh: Block PCI config access upon frozen PE
powerpc/pseries: Drop config requests in EEH accessors
powerpc/powernv: Drop config requests in EEH accessors
powerpc/eeh: Rename flag EEH_PE_RESET to EEH_PE_CFG_BLOCKED
powerpc/eeh: Fix condition for isolated state
powerpc/pseries: Make CPU hotplug path endian safe
powerpc/pseries: Use dump_stack instead of show_stack
powerpc: Rename __get_SP() to current_stack_pointer()
powerpc: Reimplement __get_SP() as a function not a define
powerpc/numa: Add ability to disable and debug topology updates
powerpc/numa: check error return from proc_create
powerpc/powernv: Fallback to old HMI handling behavior for old firmware
...
The associativity domain numbers are obtained from the hypervisor through
registers and written into memory by the guest: the packed array passed to
vphn_unpack_associativity() is then native-endian, unlike what was assumed
in the following commit:
commit b08a2a12e4
Author: Alistair Popple <alistair@popple.id.au>
Date: Wed Aug 7 02:01:44 2013 +1000
powerpc: Make NUMA device node code endian safe
This issue fills the topology with bogus data and makes it unusable. It may
lead to severe performance breakdowns.
We should ideally patch the vphn_unpack_associativity() function to do the
64-bit loads, but this requires some more brain storming.
In the meantime, let's go for a suboptimal and temporary bug fix: this patch
converts each 64-bit value of the packed array to big endian, as expected by
the current parsing code in vphn_unpack_associativity().
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle were:
- Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
Hansen)
- Various sched/idle refinements for better idle handling (Nicolas
Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)
- sched/numa updates and optimizations (Rik van Riel)
- sysbench speedup (Vincent Guittot)
- capacity calculation cleanups/refactoring (Vincent Guittot)
- Various cleanups to thread group iteration (Oleg Nesterov)
- Double-rq-lock removal optimization and various refactorings
(Kirill Tkhai)
- various sched/deadline fixes
... and lots of other changes"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
sched/fair: Delete resched_cpu() from idle_balance()
sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
sched: Improve sysbench performance by fixing spurious active migration
sched/x86: Fix up typo in topology detection
x86, sched: Add new topology for multi-NUMA-node CPUs
sched/rt: Use resched_curr() in task_tick_rt()
sched: Use rq->rd in sched_setaffinity() under RCU read lock
sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
sched: Use dl_bw_of() under RCU read lock
sched/fair: Remove duplicate code from can_migrate_task()
sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
sched: print_rq(): Don't use tasklist_lock
sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
sched: Fix the task-group check in tg_has_rt_tasks()
sched/fair: Leverage the idle state info when choosing the "idlest" cpu
sched: Let the scheduler see CPU idle states
sched/deadline: Fix inter- exclusive cpusets migrations
sched/deadline: Clear dl_entity params when setscheduling to different class
sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
...
We have hit a few customer issues with the topology update code (VPHN
and PRRN). It would be nice to be able to debug the notifications coming
from the hypervisor in both cases to the LPAR, as well as to disable
responding to the notifications at boot-time, to narrow down the source
of the problems. Add a basic level of such functionality, similar to the
numa= command-line parameter. We already have a toggle in
/proc/powerpc/topology_updates that allows run-time enabling/disabling,
so the updates can be started at run-time if desired. But the bugs we've
run into have occured during boot or very shortly after coming to login,
and have resulted in a broken NUMA topology.
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
proc_create can fail, we should check the return value and pass up the
failure.
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds hooks into the core powerpc mm code for cxl.
The core powerpc code sometimes uses local tlbie. Unfortunately this won't
work with the current cxl driver as it relies on snooping tlbie broadcasts.
The cxl hardware can have TLB entries invalidated via MMIO but this is not
currently supported by the driver. In future we can make local tlbie smarter so
that it invalidates cxl contexts via MMIO when it needs to but for now we have
this workaround.
This workaround checks for any active cxl contexts and if so, disables local
tlbie.
This also adds a hook for when SLBs are invalidated. This ensures any
corresponding SLBs in cxl are also invalidated at the same time. This is
required for segment demotion.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a new function hash_page_mm() based on the existing hash_page().
This version allows any struct mm to be passed in, rather than assuming
current. This is useful for servicing co-processor faults which are not in the
context of the current running process.
We need to be careful here as the current hash_page() assumes current in a few
places.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Export mmu_kernel_ssize and mmu_linear_psize. These are needed by the cxl
driver which has it's own MMU. To setup the MMU cxl needs access to these.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This moves spu_flush_all_slbs() into a generic call copro_flush_all_slbs().
This will be useful when we add cxl which also needs a similar SLB flush call.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
__spu_trap_data_seg() currently contains code to determine the VSID and ESID
required for a particular EA and mm struct.
This code is generically useful for other co-processors. This moves the code of
the cell platform so it can be used by other powerpc code. It also adds 1TB
segment handling which Cell didn't support. The new function is called
copro_calculate_slb().
This also moves the internal struct spu_slb to a generic struct copro_slb which
is now used in the Cell and copro code. We use this new struct instead of
passing around esid and vsid parameters.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently spu_handle_mm_fault() is in the cell platform.
This code is generically useful for other non-cell co-processors on powerpc.
This patch moves this function out of the cell platform into arch/powerpc/mm so
that others may use it.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Freescale updates from Scott (27 commits):
"Highlights include DMA32 zone support (SATA, USB, etc now works on 64-bit
FSL kernels), MSI changes, 8xx optimizations and cleanup, t104x board
support, and PrPMC PCI enumeration."
There is no need for yet another copy of the command line, just
use boot_command_line like everyone else.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Fill in the si_addr_lsb siginfo field so the hwpoison code can
pass to userspace the length of memory that has been corrupted.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
do_page_fault was missing knowledge of HWPOISON, and we would oops
if userspace tried to access a poisoned page:
kernel BUG at arch/powerpc/mm/fault.c:180!
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Exit out early for a kernel fault, avoiding indenting of
most of the function.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We can unindent the bulk of htab_dt_scan_page_sizes() by returning early
if the property is not found. That is nice in and of itself, but also
has the advantage of making it clear that we always return success once
we have found the ibm,segment-page-sizes property.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
this patches changes some error handling logics in numa_setup_cpu(),
when cpu node is not found, so:
if the cpu is possible, but not present, -1 is kept in numa_cpu_lookup_table,
so later, if the cpu is added, we could set correct numa information for it.
if the cpu is present, then we set the first online node to
numa_cpu_lookup_table instead of 0 ( in case 0 might not be an online node? )
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
As Nish suggested, it makes more sense to init the numa node informatiion
for present cpus at boottime, which could also avoid WARN_ON(1) in
numa_setup_cpu().
With this change, we also need to change the smp_prepare_cpus() to set up
numa information only on present cpus.
For those possible, but not present cpus, their numa information
will be set up after they are started, as the original code did before commit
2fabf084b6.
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Cyril Bur <cyril.bur@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With commit 2fabf084b6 ("powerpc: reorder per-cpu NUMA information's
initialization"), during boottime, cpu_numa_callback() is called
earlier(before their online) for each cpu, and verify_cpu_node_mapping()
uses cpu_to_node() to check whether siblings are in the same node.
It skips the checking for siblings that are not online yet. So the only
check done here is for the bootcpu, which is online at that time. But
the per-cpu numa_node cpu_to_node() uses hasn't been set up yet (which
will be set up in smp_prepare_cpus()).
So I saw something like following reported:
[ 0.000000] CPU thread siblings 1/2/3 and 0 don't belong to the same
node!
As we don't actually do the checking during this early stage, so maybe
we could directly call numa_setup_cpu() in do_init_bootmem().
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A recent patch added a function prototype for htab_remove_mapping in
c code. Fix it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Fix a number of places where global functions were not including
their prototype. This ensures the prototype and the function match.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 1c98025c6c "powerpc: Dynamic DMA
zone limits" updated how zones are created in paging_init(), but missed
the NUMA version of paging_init(). This was noticed via a linker
error, since dma_pfn_limit_to_zone() was, like the non-NUMA
paging_init(), limited by #ifndef CONFIG_NEED_MULTIPLE_NODES.
It turns out that the NUMA paging_init() was not actually doing
anything different from the standard paging_init(), other than a couple
debug prints, a couple 32-bit-only ifdef sections, and a call to
mark_nonram_nosave(). It's not clear whether mark_nonram_nosave() is
inherently wrong to do for NUMA, or just not useful on targets that
have NUMA, but for now I'm preserving the existing behavior.
Fixes: 1c98025c6c "powerpc: Dynamic DMA zone limits"
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Scott Wood <scottwood@freescale.com>
Tasks get their end of stack set to STACK_END_MAGIC with the
aim to catch stack overruns. Currently this feature does not
apply to init_task. This patch removes this restriction.
Note that a similar patch was posted by Prarit Bhargava
some time ago but was never merged:
http://marc.info/?l=linux-kernel&m=127144305403241&w=2
Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: aneesh.kumar@linux.vnet.ibm.com
Cc: dzickus@redhat.com
Cc: bmr@redhat.com
Cc: jcastillo@redhat.com
Cc: jgh@redhat.com
Cc: minchan@kernel.org
Cc: tglx@linutronix.de
Cc: hannes@cmpxchg.org
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Daeseok Youn <daeseok.youn@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1410527779-8133-2-git-send-email-atomlin@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Platform code can call limit_zone_pfn() to set appropriate limits
for ZONE_DMA and ZONE_DMA32, and dma_direct_alloc_coherent() will
select a suitable zone based on a device's mask and the pfn limits that
platform code has configured.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Shaohui Xie <Shaohui.Xie@freescale.com>
Add tracepoint to track hugepage invalidate. This help us
in debugging difficult to track bugs.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We would get wrong results in compiler recomputed old_pmd. Avoid
that by using ACCESS_ONCE
CC: <stable@vger.kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
As per ISA, for 4k base page size we compare 14..65 bits of VA specified
with the entry_VA in tlb. That implies we need to make sure we do a
tlbie with all the possible 4k va we used to access the 16MB hugepage.
With 64k base page size we compare 14..57 bits of VA. Hence we cannot
ignore the lower 24 bits of va while tlbie .We also cannot tlb
invalidate a 16MB entry with just one tlbie instruction because
we don't track which va was used to instantiate the tlb entry.
CC: <stable@vger.kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we changed base page size of the segment, either via sub_page_protect
or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
table entries. We do a lazy hash page table flush for all mapped pages
in the demoted segment. This happens when we handle hash page fault for
these pages.
We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
that implies that we could possibly have older 64K hash pte entries in
the hash page table and we need to invalidate those entries.
Use _PAGE_COMBO to determine the page size with which we should
invalidate the hash table entries on unmap.
CC: <stable@vger.kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we changed base page size of the segment, either via sub_page_protect
or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
table entries. We do a lazy hash page table flush for all mapped pages
in the demoted segment. This happens when we handle hash page fault
for these pages.
We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
that implies that we could possibly have older 64K hash pte entries in
the hash page table and we need to invalidate those entries.
Handle this correctly for 16M pages
CC: <stable@vger.kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The segment identifier and segment size will remain the same in
the loop, So we can compute it outside. We also change the
hugepage_invalidate interface so that we can use it the later patch
CC: <stable@vger.kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
With hugepages, we store the hpte valid information in the pte page
whose address is stored in the second half of the PMD. Use a
write barrier to make sure clearing pmd busy bit and updating
hpte valid info are ordered properly.
CC: <stable@vger.kernel.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.
NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().
Commit 6ee0578b4d ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.
Additionally, the following commits changed init_workqueues() to use
cpu_to_node to determine the node to use for kthread_create_on_node:
bce903809a ("workqueue: add wq_numa_tbl_len and
wq_numa_possible_cpumask[]")
f3f90ad469 ("workqueue: determine NUMA node of workers accourding to
the allowed cpumask")
Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html).
Fix this by initializing the powerpc-specific CPU<->node/local memory
node mapping as early as possible, which on powerpc is
do_init_bootmem(). Currently that function initializes the mapping for
the boot CPU, but we extend it to setup the mapping for all possible
CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the
per-cpu values for all possible CPUs. That ensures that before the
early_initcalls run (and really as early as possible), the per-cpu NUMA
mapping is accurate.
While testing memoryless nodes on PowerKVM guests with a fix to the
workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
guest topology of:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 16336 MB
node 1 free: 15329 MB
node distances:
node 0 1
0: 10 40
1: 40 10
the slab consumption decreases from
Slab: 932416 kB
SUnreclaim: 902336 kB
to
Slab: 395264 kB
SUnreclaim: 359424 kB
And we a corresponding increase in the slab efficiency from
slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 337 MB 11.28% 100.00%
task_struct 288 MB 9.93% 100.00%
to
slab mem objs slabs
used active active
------------------------------------------------------------
kmalloc-16384 37 MB 100.00% 100.00%
task_struct 31 MB 100.00% 100.00%
Powerpc didn't support memoryless nodes until recently (64bb80d87f
"powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c27226119
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
helped improve memory consumption with these kind of environments.
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
__early_init_mmu() does some things that are really only needed by the
boot cpu. On FSL booke, This includes calling
memblock_enforce_memory_limit(), which is labelled __init. Secondary
cpu init code can't be __init as that would break CPU hotplug.
While it's probably a bug that memblock_enforce_memory_limit() isn't
__init_memblock instead, there's no reason why we should be doing this
stuff for secondary cpus in the first place.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Rather than have architectures #define ARCH_HAS_SG_CHAIN in an
architecture specific scatterlist.h, make it a proper Kconfig option and
use that instead. At same time, remove the header files are are now
mostly useless and just include asm-generic/scatterlist.h.
[sfr@canb.auug.org.au: powerpc files now need asm/dma.h]
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de> [x86]
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [powerpc]
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull powerpc updates from Ben Herrenschmidt:
"This is the powerpc new goodies for 3.17. The short story:
The biggest bit is Michael removing all of pre-POWER4 processor
support from the 64-bit kernel. POWER3 and rs64. This gets rid of a
ton of old cruft that has been bitrotting in a long while. It was
broken for quite a few versions already and nobody noticed. Nobody
uses those machines anymore. While at it, he cleaned up a bunch of
old dusty cabinets, getting rid of a skeletton or two.
Then, we have some base VFIO support for KVM, which allows assigning
of PCI devices to KVM guests, support for large 64-bit BARs on
"powernv" platforms, support for HMI (Hardware Management Interrupts)
on those same platforms, some sparse-vmemmap improvements (for memory
hotplug),
There is the usual batch of Freescale embedded updates (summary in the
merge commit) and fixes here or there, I think that's it for the
highlights"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (102 commits)
powerpc/eeh: Export eeh_iommu_group_to_pe()
powerpc/eeh: Add missing #ifdef CONFIG_IOMMU_API
powerpc: Reduce scariness of interrupt frames in stack traces
powerpc: start loop at section start of start in vmemmap_populated()
powerpc: implement vmemmap_free()
powerpc: implement vmemmap_remove_mapping() for BOOK3S
powerpc: implement vmemmap_list_free()
powerpc: Fail remap_4k_pfn() if PFN doesn't fit inside PTE
powerpc/book3s: Fix endianess issue for HMI handling on napping cpus.
powerpc/book3s: handle HMIs for cpus in nap mode.
powerpc/powernv: Invoke opal call to handle hmi.
powerpc/book3s: Add basic infrastructure to handle HMI in Linux.
powerpc/iommu: Fix comments with it_page_shift
powerpc/powernv: Handle compound PE in config accessors
powerpc/powernv: Handle compound PE for EEH
powerpc/powernv: Handle compound PE
powerpc/powernv: Split ioda_eeh_get_state()
powerpc/powernv: Allow to freeze PE
powerpc/powernv: Enable M64 aperatus for PHB3
powerpc/eeh: Aux PE data for error log
...
This patch introduces zone_for_memory() to arch_add_memory() on powerpc
to ensure new, higher memory added into ZONE_MOVABLE if movable zone has
already setup.
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: "Mel Gorman" <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vmemmap_populated() checks whether the [start, start + page_size) has valid
pfn numbers, to know whether a vmemmap mapping has been created that includes
this range.
Some range before end might not be checked by this loop:
sec11start......start11..sec11end/sec12start..end....start12..sec12end
as the above, for start11(section 11), it checks [sec11start, sec11end), and
loop ends as the next start(start12) is bigger than end. However,
[sec11end/sec12start, end) is not checked here.
So before the loop, adjust the start to be the start of the section, so we don't miss ranges like the above.
After we adjust start to be the start of the section, it also means it's
aligned with vmemmap as of the sizeof struct page, so we could use
page_to_pfn directly in the loop.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Acked-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
vmemmap_free() does the opposite of vmemap_populate().
This patch also puts vmemmap_free() and vmemmap_list_free() into
CONFIG_MEMMORY_HOTPLUG.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Acked-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This is to be called in vmemmap_free(), leave the implementation on BOOK3E
empty as before.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Acked-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch implements vmemmap_list_free() for vmemmap_free().
The freed entries will be removed from vmemmap_list, and form a freed list,
with next as the header. The next position in the last allocated page is kept
at the list tail.
When allocation, if there are freed entries left, get it from the freed list;
if no freed entries left, get it like before from the last allocated pages.
With this change, realmode_pfn_to_page() also needs to be changed to walk
all the entries in the vmemmap_list, as the virt_addr of the entries might not
be stored in order anymore.
It helps to reuse the memory when continuous doing memory hot-plug/remove
operations, but didn't reclaim the pages already allocated, so the memory usage
will only increase, but won't exceed the value for the largest memory
configuration.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Acked-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This silences a section mismatch warning. early_alloc_pgtable() is
called from map_kernel_page() which cannot be __init, but only when
slab_is_available() returns false which can only happen during early
boot.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Scott writes:
Highlights include e6500 hardware threading support, an e6500 TLB erratum
workaround, corenet error reporting, support for a new board, and some
minor fixes.
Erratum A-008139 can cause duplicate TLB entries if an indirect
entry is overwritten using tlbwe while the other thread is using it to
do a lookup. Work around this by using tlbilx to invalidate prior
to overwriting.
To avoid the need to save another register to hold MAS1 during the
workaround code, TID clearing has been moved from tlb_miss_kernel_e6500
until after the SMT section.
Signed-off-by: Scott Wood <scottwood@freescale.com>
There are still a few occurences where it remains, because it helps to
explain something that persists.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Now that we have dropped power3 support we can remove CONFIG_POWER3. The
usage in pgtable_32.c was already dead code as CONFIG_POWER3 was not
selectable on PPC32.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We now only support cpus that use an SLB, so we don't need an MMU
feature to indicate that.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Old cpus didn't have a Segment Lookaside Buffer (SLB), instead they had
a Segment Table (STAB). Now that we've dropped support for those cpus,
we can remove the STAB support entirely.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In fb5a515704 "powerpc: Remove platforms/wsp and associated pieces",
we removed the last user of MMU_FTRS_A2. So remove it.
MMU_FTRS_A2 was the last user of MMU_FTR_TYPE_3E, so remove it also.
This leaves some unreachable code in mmu_context_nohash.c, so remove
that also.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Virtualized environments may expose a e6500 dual-threaded core
as two single-threaded e6500 cores. Take advantage of this
and get rid of the tlb lock and the trap-causing tlbsx in
the htw miss handler by guarding with CPU_FTR_SMT, as it's
already being done in the bolted tlb1 miss handler.
As seen in the results below, measurements done with lmbench
random memory access latency test running under Freescale's
Embedded Hypervisor, there is a ~34% improvement.
Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
----------------------------------------------------
Host Mhz L1 $ L2 $ Main mem Rand mem
--------- --- ---- ---- -------- --------
smt 1665 1.8020 13.2 83.0 1149.7
nosmt 1665 1.8020 13.2 83.0 758.1
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Cc: Scott Wood <scottwood@freescale.com>
[scottwood@freescale.com: commit message tweak]
Signed-off-by: Scott Wood <scottwood@freescale.com>
Commit 82d86de25b "TLB lock recursive"
introduced a bug whereby cpu 0 uses the same value for "lock held" as
is used to indicate that the lock is free. This means that cpu 1 can
acquire the lock whenever it wants, regardless of whether cpu 0 has it
locked, which in turn means we can get duplicate TLB entries.
Add one to the CPU value to ensure we do not use zero as a "lock held"
value.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Reported-by: Ed Swarthout <ed.swarthout@freescale.com>
Previously TID was being cleared before the tlbsx, but not after. This
can lead to a multiway hit between a TLB entry with TID=0 (previously
inserted when PID=0) and a TLB entry with TID!=0 that matches PID.
This can theoretically result in undefined behavior, though we probably
get lucky due to the details of the overlap. It also results in the
inability to use multihit detection to detect other conflicting TLB
entries, as well as poorer TLB utilization due to duplicating kernel
TLB entries.
Rather than try to patch up MAS1 after tlbsx, the entire value is
saved/restored as with MAS2.
I observed a slight improvement in TLB miss performance with this patch
applied.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Reported-by: Ed Swarthout <ed.swarthout@freescale.com>
Pull powerpc updates from Ben Herrenschmidt:
"Here is the bulk of the powerpc changes for this merge window. It got
a bit delayed in part because I wasn't paying attention, and in part
because I discovered I had a core PCI change without a PCI maintainer
ack in it. Bjorn eventually agreed it was ok to merge it though we'll
probably improve it later and I didn't want to rebase to add his ack.
There is going to be a bit more next week, essentially fixes that I
still want to sort through and test.
The biggest item this time is the support to build the ppc64 LE kernel
with our new v2 ABI. We previously supported v2 userspace but the
kernel itself was a tougher nut to crack. This is now sorted mostly
thanks to Anton and Rusty.
We also have a fairly big series from Cedric that add support for
64-bit LE zImage boot wrapper. This was made harder by the fact that
traditionally our zImage wrapper was always 32-bit, but our new LE
toolchains don't really support 32-bit anymore (it's somewhat there
but not really "supported") so we didn't want to rely on it. This
meant more churn that just endian fixes.
This brings some more LE bits as well, such as the ability to run in
LE mode without a hypervisor (ie. under OPAL firmware) by doing the
right OPAL call to reinitialize the CPU to take HV interrupts in the
right mode and the usual pile of endian fixes.
There's another series from Gavin adding EEH improvements (one day we
*will* have a release with less than 20 EEH patches, I promise!).
Another highlight is the support for the "Split core" functionality on
P8 by Michael. This allows a P8 core to be split into "sub cores" of
4 threads which allows the subcores to run different guests under KVM
(the HW still doesn't support a partition per thread).
And then the usual misc bits and fixes ..."
[ Further delayed by gmail deciding that BenH is a dirty spammer.
Google knows. ]
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (155 commits)
powerpc/powernv: Add missing include to LPC code
selftests/powerpc: Test the THP bug we fixed in the previous commit
powerpc/mm: Check paca psize is up to date for huge mappings
powerpc/powernv: Pass buffer size to OPAL validate flash call
powerpc/pseries: hcall functions are exported to modules, need _GLOBAL_TOC()
powerpc: Exported functions __clear_user and copy_page use r2 so need _GLOBAL_TOC()
powerpc/powernv: Set memory_block_size_bytes to 256MB
powerpc: Allow ppc_md platform hook to override memory_block_size_bytes
powerpc/powernv: Fix endian issues in memory error handling code
powerpc/eeh: Skip eeh sysfs when eeh is disabled
powerpc: 64bit sendfile is capped at 2GB
powerpc/powernv: Provide debugfs access to the LPC bus via OPAL
powerpc/serial: Use saner flags when creating legacy ports
powerpc: Add cpu family documentation
powerpc/xmon: Fix up xmon format strings
powerpc/powernv: Add calls to support little endian host
powerpc: Document sysfs DSCR interface
powerpc: Fix regression of per-CPU DSCR setting
powerpc: Split __SYSFS_SPRSETUP macro
arch: powerpc/fadump: Cleaning up inconsistent NULL checks
...
We have a bug in our hugepage handling which exhibits as an infinite
loop of hash faults. If the fault is being taken in the kernel it will
typically trigger the softlockup detector, or the RCU stall detector.
The bug is as follows:
1. mmap(0xa0000000, ..., MAP_FIXED | MAP_HUGE_TLB | MAP_ANONYMOUS ..)
2. Slice code converts the slice psize to 16M.
3. The code on lines 539-540 of slice.c in slice_get_unmapped_area()
synchronises the mm->context with the paca->context. So the paca slice
mask is updated to include the 16M slice.
3. Either:
* mmap() fails because there are no huge pages available.
* mmap() succeeds and the mapping is then munmapped.
In both cases the slice psize remains at 16M in both the paca & mm.
4. mmap(0xa0000000, ..., MAP_FIXED | MAP_ANONYMOUS ..)
5. The slice psize is converted back to 64K. Because of the check on line 539
of slice.c we DO NOT update the paca->context. The paca slice mask is now
out of sync with the mm slice mask.
6. User/kernel accesses 0xa0000000.
7. The SLB miss handler slb_allocate_realmode() **uses the paca slice mask**
to create an SLB entry and inserts it in the SLB.
18. With the 16M SLB entry in place the hardware does a hash lookup, no entry
is found so a data access exception is generated.
19. The data access handler calls do_page_fault() -> handle_mm_fault().
10. __handle_mm_fault() creates a THP mapping with do_huge_pmd_anonymous_page().
11. The hardware retries the access, there is still nothing in the hash table
so once again a data access exception is generated.
12. hash_page() calls into __hash_page_thp() and inserts a mapping in the
hash. Although the THP mapping maps 16M the hashing is done using 64K
as the segment page size.
13. hash_page() returns immediately after calling __hash_page_thp(), skipping
over the code at line 1125. Resulting in the mismatch between the
paca->context and mm->context not being detected.
14. The hardware retries the access, the hash it generates using the 16M
SLB entry does NOT match the hash we inserted.
15. We take another data access and go into __hash_page_thp().
16. We see a valid entry in the hpte_slot_array and so we call updatepp()
which succeeds.
17. Goto 14.
We could fix this in two ways. The first would be to remove or modify
the check on line 539 of slice.c.
The second option is to cause the check of paca psize in hash_page() on
line 1125 to also be done for THP pages.
We prefer the latter, because the check & update of the paca psize is
not done until we know it's necessary. It's also done only on the
current cpu, so we don't need to IPI all other cpus.
Without further rearranging the code, the simplest fix is to pull out
the code that checks paca psize and call it in two places. Firstly for
THP/hugetlb, and secondly for other mappings as before.
Thanks to Dave Jones for trinity, which originally found this bug.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: stable@vger.kernel.org [v3.11+]
Currently hugepage migration is available for all archs which support
pmd-level hugepage, but testing is done only for x86_64 and there're
bugs for other archs. So to avoid breaking such archs, this patch
limits the availability strictly to x86_64 until developers of other
archs get interested in enabling this feature.
Simply disabling hugepage migration on non-x86_64 archs is not enough to
fix the reported problem where sys_move_pages() hits the BUG_ON() in
follow_page(FOLL_GET), so let's fix this by checking if hugepage
migration is supported in vma_migratable().
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Miller <davem@davemloft.net>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Another round of clean-up of FDT related code in architecture code.
This removes knowledge of internal FDT details from most architectures
except powerpc.
- Conversion of kernel's custom FDT parsing code to use libfdt.
- DT based initialization for generic serial earlycon. The introduction
of generic serial earlycon support went in thru tty tree.
- Improve the platform device naming for DT probed devices to ensure
unique naming and use parent names instead of a global index.
- Fix a race condition in of_update_property.
- Unify the various linker section OF match tables and fix several
function prototype errors.
- Update platform_get_irq_byname to work in deferred probe cases.
- 2 binding doc updates
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJTjzgyAAoJEMhvYp4jgsXiFsUH/1PMTGo8CyD62VQD5ZKdAoW+
Fq6vCiRQ8assF5i5ZLcW1DqhjtoRaCKYhVbRKa5lj7cZdjlSpacI/qQPrF5Br2Ii
bTE3Ff/AQwipQaz/Bj7HqJCgGwfWK8xdfgW0abKsyXMWDN86Bov/zzeu8apmws0x
H1XjJRgnc/rzM4m9ny6+lss0iq6YL54SuTYNzHR33+Ywxls69SfHXIhCW0KpZcBl
5U3YUOomt40GfO46sxFA4xApAhypEK4oVq7asyiA2ArTZ/c2Pkc9p5CBqzhDLmlq
yioWTwHIISv0q+yMLCuQrVGIsbUDkQyy7RQ15z6U+/e/iGO/M+j3A5yxMc3qOi4=
=Onff
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux into next
Pull DeviceTree updates from Rob Herring:
- Another round of clean-up of FDT related code in architecture code.
This removes knowledge of internal FDT details from most
architectures except powerpc.
- Conversion of kernel's custom FDT parsing code to use libfdt.
- DT based initialization for generic serial earlycon. The
introduction of generic serial earlycon support went in through the
tty tree.
- Improve the platform device naming for DT probed devices to ensure
unique naming and use parent names instead of a global index.
- Fix a race condition in of_update_property.
- Unify the various linker section OF match tables and fix several
function prototype errors.
- Update platform_get_irq_byname to work in deferred probe cases.
- 2 binding doc updates
* tag 'devicetree-for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (58 commits)
of: handle NULL node in next_child iterators
of/irq: provide more wrappers for !CONFIG_OF
devicetree: bindings: Document micrel vendor prefix
dt: bindings: dwc2: fix required value for the phy-names property
of_pci_irq: kill useless variable in of_irq_parse_pci()
of/irq: do irq resolution in platform_get_irq_byname()
of: Add a testcase for of_find_node_by_path()
of: Make of_find_node_by_path() handle /aliases
of: Create unlocked version of for_each_child_of_node()
lib: add glibc style strchrnul() variant
of: Handle memory@0 node on PPC32 only
pci/of: Remove dead code
of: fix race between search and remove in of_update_property()
of: Use NULL for pointers
of: Stop naming platform_device using dcr address
of: Ensure unique names without sacrificing determinism
tty/serial: pl011: add DT based earlycon support
of/fdt: add FDT serial scanning for earlycon
of/fdt: add FDT address translation support
serial: earlycon: add DT support
...
was a pretty active cycle for KVM. Changes include:
- a lot of s390 changes: optimizations, support for migration,
GDB support and more
- ARM changes are pretty small: support for the PSCI 0.2 hypercall
interface on both the guest and the host (the latter acked by Catalin)
- initial POWER8 and little-endian host support
- support for running u-boot on embedded POWER targets
- pretty large changes to MIPS too, completing the userspace interface
and improving the handling of virtualized timer hardware
- for x86, a larger set of changes is scheduled for 3.17. Still,
we have a few emulator bugfixes and support for running nested
fully-virtualized Xen guests (para-virtualized Xen guests have
always worked). And some optimizations too.
The only missing architecture here is ia64. It's not a coincidence
that support for KVM on ia64 is scheduled for removal in 3.17.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJTjtlBAAoJEBvWZb6bTYbyMOUP/2NAePghE3IjG99ikHFdn+BX
BfrURsuR6GD0AhYQnBidBmpFbAmN/LwSJxv/M7sV7OBRWLu3qbt69DrPTU2e/FK1
j9q25peu8jRyHzJ1q9rBroo74nD9lQYuVr3uXNxxcg0DRnw14JHGlM3y8LDEknO8
W+gpWTeAQ+2AuOX98MpRbCRMuzziCSv5bP5FhBVnsWHiZfvMbcUrbeJt+zYSiDAZ
0tHm/5dFKzfj/vVrrnjD4EZcRr688Bs5rztG96hY6aoVJryjZGLtLp92wCWkRRmH
CCvZwd245NmNthuKHzcs27/duSWfU0uOlu7AMrD44QYhzeDGyB/2nbCxbGqLLoBA
nnOviXH4cC65/CnisZ79zfo979HbZcX+Lzg747EjBgCSxJmLlwgiG8yXtDvk5otB
TH6GUeGDiEEPj//JD3XtgSz0sF2NvjREWRyemjDMvhz6JC/bLytXKb3sn+NXSj8m
ujzF9eQoa4qKDcBL4IQYGTJ4z5nY3Pd68dHFIPHB7n82OxFLSQUBKxXw8/1fb5og
VVb8PL4GOcmakQlAKtTMlFPmuy4bbL2r/2iV5xJiOZKmXIu8Hs1JezBE3SFAltbl
3cAGwSM9/dDkKxUbTFblyOE9bkKbg4WYmq0LkdzsPEomb3IZWntOT25rYnX+LrBz
bAknaZpPiOrW11Et1htY
=j5Od
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into next
Pull KVM updates from Paolo Bonzini:
"At over 200 commits, covering almost all supported architectures, this
was a pretty active cycle for KVM. Changes include:
- a lot of s390 changes: optimizations, support for migration, GDB
support and more
- ARM changes are pretty small: support for the PSCI 0.2 hypercall
interface on both the guest and the host (the latter acked by
Catalin)
- initial POWER8 and little-endian host support
- support for running u-boot on embedded POWER targets
- pretty large changes to MIPS too, completing the userspace
interface and improving the handling of virtualized timer hardware
- for x86, a larger set of changes is scheduled for 3.17. Still, we
have a few emulator bugfixes and support for running nested
fully-virtualized Xen guests (para-virtualized Xen guests have
always worked). And some optimizations too.
The only missing architecture here is ia64. It's not a coincidence
that support for KVM on ia64 is scheduled for removal in 3.17"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits)
KVM: add missing cleanup_srcu_struct
KVM: PPC: Book3S PR: Rework SLB switching code
KVM: PPC: Book3S PR: Use SLB entry 0
KVM: PPC: Book3S HV: Fix machine check delivery to guest
KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
KVM: PPC: Book3S HV: Fix dirty map for hugepages
KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
KVM: PPC: Book3S: Add ONE_REG register names that were missed
KVM: PPC: Add CAP to indicate hcall fixes
KVM: PPC: MPIC: Reset IRQ source private members
KVM: PPC: Graciously fail broken LE hypercalls
PPC: ePAPR: Fix hypercall on LE guest
KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
KVM: PPC: BOOK3S: Always use the saved DAR value
PPC: KVM: Make NX bit available with magic page
KVM: PPC: Disable NX for old magic page using guests
KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
...
On LPAR guest systems Linux enables the shadow SLB to indicate to the
hypervisor a number of SLB entries that always have to be available.
Today we go through this shadow SLB and disable all ESID's valid bits.
However, pHyp doesn't like this approach very much and honors us with
fancy machine checks.
Fortunately the shadow SLB descriptor also has an entry that indicates
the number of valid entries following. During the lifetime of a guest
we can just swap that value to 0 and don't have to worry about the
SLB restoration magic.
While we're touching the code, let's also make it more readable (get
rid of rldicl), allow it to deal with a dynamic number of bolted
SLB entries and only do shadow SLB swizzling on LPAR systems.
Signed-off-by: Alexander Graf <agraf@suse.de>
The only way Freescale booke chips support mappings larger than 4K
is via TLB1. The only way we support (direct) TLB1 entries is via
hugetlb, which is not what map_kernel_page() does when given a large
page size.
Without this, a kernel with CONFIG_SPARSEMEM_VMEMMAP enabled crashes on
boot with messages such as:
PID hash table entries: 4096 (order: 3, 32768 bytes)
Sorting __ex_table...
BUG: Bad page state in process swapper pfn:00a2f
page:8000040000023a48 count:0 mapcount:0 mapping:0000040000ffce48 index:0x40000ffbe50
page flags: 0x40000ffda40(active|arch_1|private|private_2|head|tail|swapcache|mappedtodisk|reclaim|swapbacked|unevictable|mlocked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
page flags: 0x311840(active|private|private_2|swapcache|unevictable|mlocked)
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 3.15.0-rc1-00003-g7fa250c #299
Call Trace:
[c00000000098ba20] [c000000000008b3c] .show_stack+0x7c/0x1cc (unreliable)
[c00000000098baf0] [c00000000060aa50] .dump_stack+0x88/0xb4
[c00000000098bb70] [c0000000000c0468] .bad_page+0x144/0x1a0
[c00000000098bc10] [c0000000000c0628] .free_pages_prepare+0x164/0x17c
[c00000000098bcc0] [c0000000000c24cc] .free_hot_cold_page+0x48/0x214
[c00000000098bd60] [c00000000086c318] .free_all_bootmem+0x1fc/0x354
[c00000000098be70] [c00000000085da84] .mem_init+0xac/0xdc
[c00000000098bef0] [c0000000008547b0] .start_kernel+0x21c/0x4d4
[c00000000098bf90] [c000000000000448] .start_here_common+0x20/0x58
Signed-off-by: Scott Wood <scottwood@freescale.com>
This request includes a few bug fixes that really shouldn't wait for the next
release.
It fixes KVM on 32bit PowerPC when built as module. It also fixes the PV KVM
acceleration when NX gets honored by the host. Furthermore we fix transactional
memory support and numa support on HV KVM.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJTcKFaAAoJECszeR4D/txg7qYP/RX3V32i2zQYH2NpjQrDCwtY
Wur+CQrn/VA6xhtTK1rT2zH5rNFLt6ClhtxCMkZFfBdUE4sHi3OTlEdcvXBZjbls
JqQ/7lBkUPN8pTpz2NHP9gvH7g6v07EruysRQNa/JZMzlwhpzWk8D7yXakaCPNY/
JZRgVTrfKnhQ8OtXt48Bp4EmEKllbNqi9kNN7dewD2dEb3fAco3Jpk6WoeG+1f0o
jv3NmeTsp87KaRpjvDzPb7iCe6PA7GVqvJIQpir3Rpk2Kpx0yj58AfacF+f72GOf
CPlJGepiumJCaANhV6dbvtS49vaiiAnSvbqCil2USNl0LIGWQXdSjs5lztEuiMyr
tAav0YSVpnIcw0HJxXug/M31VwfRjYCX3hnCCIOd3Xj2jgAqwD+Lo95uUrRGJ9TP
75zKh8E093tOXIC9CyMaiYajpFMUrCSMgnpJ+7fpeHiyigB6yc8juFxahIHsw8q1
NgHggroJm6QNIm8JSY/tG/YET4AT7H4ZetGP8MeeRUg0TpqQXvYpkMGB8YDouaBA
XzxjwyTq57BOYgLGExnwW3Jj0kbqVY+ts0aDGQVGrl5YFzooGqrQ61CRmwG5BvI8
sou3l6TJ2ng8qrc7Maw9MHca1QB3mtXD7I26T/QEfQm9NLRTTqJyaxH5J1q9siRI
PpHVE5FKnmWPNr8JlxtC
=t2S+
-----END PGP SIGNATURE-----
Merge tag 'signed-for-3.15' of git://github.com/agraf/linux-2.6 into kvm-master
Patch queue for 3.15 - 2014-05-12
This request includes a few bug fixes that really shouldn't wait for the next
release.
It fixes KVM on 32bit PowerPC when built as module. It also fixes the PV KVM
acceleration when NX gets honored by the host. Furthermore we fix transactional
memory support and numa support on HV KVM.
This series adds support for building the powerpc 64-bit
LE kernel using the new ABI v2. We already supported
running ABI v2 userspace programs but this adds support
for building the kernel itself using the new ABI.
In case of extending the eaddr in future, use this macro
PGTABLE_EADDR_SIZE to ease the maintenance of the code.
Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When we never get around to seeing an HEA ethernet adapter, there's
no point in restricting ourselves to 4k IO page size.
This speeds up IO maps when CONFIG_IBMEBUS is disabled.
[ Updated the test to also lift the restriction on arch 2.07
(Power 8) which cannot have an HEA
-- BenH ]
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
foo
Make of_get_flat_dt_prop arguments compatible with libfdt fdt_getprop
call in preparation to convert FDT code to use libfdt. Make the return
value const and the property length ptr type an int.
Signed-off-by: Rob Herring <robh@kernel.org>
Tested-by: Michal Simek <michal.simek@xilinx.com>
Tested-by: Grant Likely <grant.likely@linaro.org>
Tested-by: Stephen Chivers <schivers@csc.com>
Our PV guest patching code assembles chunks of instructions on the fly when it
encounters more complicated instructions to hijack. These instructions need
to live in a section that we don't mark as non-executable, as otherwise we
fault when jumping there.
Right now we put it into the .bss section where it automatically gets marked
as non-executable. Add a check to the NX setting function to ensure that we
leave these particular pages executable.
Signed-off-by: Alexander Graf <agraf@suse.de>
The if condition check was based on a draft ISA doc. Remove the same.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The MMU hashtable and SLB branch patching code uses function
pointers for the update sites. This creates a difference between
ABIv1 and ABIv2 because we don't have function descriptors on
ABIv2.
Get rid of the function pointer and just point at the update
sites directly. This works on both ABIs.
Signed-off-by: Anton Blanchard <anton@samba.org>
binutils is smart enough to know that a branch to a function
descriptor is actually a branch to the functions text address.
Alan tells me that binutils has been doing this for 9 years.
Signed-off-by: Anton Blanchard <anton@samba.org>
Since v1:
Edited the comment according to Srivatsa's suggestion.
During the testing, we encounter below WARN followed by Oops:
WARNING: at kernel/sched/core.c:6218
...
NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200
LR [c000000000101358] .build_sched_domains+0xec8/0x1200
PACATMSCRATCH [800000000000f032]
Call Trace:
[c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200
[c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
[c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
[c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
...
Oops: Kernel access of bad area, sig: 11 [#1]
...
NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0
LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200
PACATMSCRATCH [8000000000029032]
Call Trace:
[c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0
[c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200
[c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
[c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
[c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
...
This was caused by that 'sd->groups == NULL' after building groups, which
was caused by the empty 'sd->span'.
The cpu's domain contained nothing because the cpu was assigned to a wrong
node, due to the following unfortunate sequence of events:
1. The hypervisor sent a topology update to the guest OS, to notify changes
to the cpu-node mapping. However, the update was actually redundant - i.e.,
the "new" mapping was exactly the same as the old one.
2. Due to this, the 'updated_cpus' mask turned out to be empty after exiting
the 'for-loop' in arch_update_cpu_topology().
3. So we ended up calling stop-machine() with an empty cpumask list, which made
stop-machine internally elect cpumask_first(cpu_online_mask), i.e., CPU0 as
the cpu to run the payload (the update_cpu_topology() function).
4. This causes update_cpu_topology() to be run by CPU0. And since 'updates'
is kzalloc()'ed inside arch_update_cpu_topology(), update_cpu_topology()
finds update->cpu as well as update->new_nid to be 0. In other words, we
end up assigning CPU0 (and eventually its siblings) to node 0, incorrectly.
Along with the following wrong updating, it causes the sched-domain rebuild
code to break and crash the system.
Fix this by skipping the topology update in cases where we find that
the topology has not actually changed in reality (ie., spurious updates).
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Nathan Fontenot <nfont@linux.vnet.ibm.com>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Robert Jennings <rcj@linux.vnet.ibm.com>
CC: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
CC: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
CC: Alistair Popple <alistair@popple.id.au>
Suggested-by: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We need to handle numa pte via the slow path
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Freescale updates from Scott. Mostly support for critical
and machine check exceptions on 64-bit BookE, some new
PCI suspend/resume work and misc bits.
We have generic code like the one in get_futex_key that assume that
a local_irq_disable prevents a parallel THP split. Support that by
adding a dummy smp call function after setting _PAGE_SPLITTING. Code
paths like get_user_pages_fast still need to check for _PAGE_SPLITTING
after disabling IRQ which indicate that a parallel THP splitting is
ongoing. Now if they don't find _PAGE_SPLITTING set, then we can be
sure that parallel split will now block in pmdp_splitting flush
until we enables IRQ
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add special state saving for critical and machine check exceptions.
Most of this code could be used to handle debug exceptions taken from
kernel space, but actually doing so is outside the scope of this patch.
The various critical and machine check exceptions now point to their
real handlers, rather than hanging the kernel.
Signed-off-by: Scott Wood <scottwood@freescale.com>
While bolted handlers (including e6500) do not need to deal with a TLB
miss recursively causing another TLB miss, nested TLB misses can still
happen with crit/mc/debug exceptions -- so we still need to honor
SPRG_TLB_EXFRAME.
We don't need to spend time modifying it in the TLB miss fastpath,
though -- the special level exception will handle that.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
Cc: kvm-ppc@vger.kernel.org
Once special level interrupts are supported, we may take nested TLB
misses -- so allow the same thread to acquire the lock recursively.
The lock will not be effective against the nested TLB miss handler
trying to write the same entry as the interrupted TLB miss handler, but
that's also a problem on non-threaded CPUs that lack TLB write
conditional. This will be addressed in the patch that enables crit/mc
support by invalidating the TLB on return from level exceptions.
Signed-off-by: Scott Wood <scottwood@freescale.com>
The previous code added wrong TLBs and causes machine check errors if
a driver accessed passed the end of the linear mapping instead of
a clean page fault.
Signed-off-by: Ralph E. Bellofatto <ralphbel@us.ibm.com>
Signed-off-by: Benjamin Krill <ben@codiert.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The memory remove code for powerpc/pseries should call remove_memory()
so that we are holding the hotplug_memory lock during memory remove
operations.
This patch updates the memory node remove handler to call remove_memory()
and adds a ppc_md.remove_memory() entry to handle pseries specific work
that is called from arch_remove_memory().
During memory remove in pseries_remove_memblock() we have to stay with
removing memory one section at a time. This is needed because of how memory
resources are handled. During memory add for pseries (via the probe file in
sysfs) we add memory one section at a time which gives us a memory resource
for each section. Future patches will aim to address this so will not have
to remove memory one section at a time.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
pte_update() is a powerpc-ism used to change the bits of a PTE
when the access permission is being restricted (a flush is
potentially needed).
It uses atomic operations on when needed and handles the hash
synchronization on hash based processors.
It is currently only used to clear PTE bits and so the current
implementation doesn't provide a way to also set PTE bits.
The new _PAGE_NUMA bit, when set, is actually restricting access
so it must use that function too, so this change adds the ability
for pte_update() to also set bits.
We will use this later to set the _PAGE_NUMA bit.
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On p8 systems, with relocation on exception feature enabled we are seeing
kdump kernel hang at interrupt vector 0xc*4400. The reason is, with this
feature enabled, exception are raised with MMU (IR=DR=1) ON with the
default offset of 0xc*4000. Since exception is raised in virtual mode it
requires the vector region to be executable without which it fails to
fetch and execute instruction at 0xc*4xxx. For default kernel since kernel
is loaded at real 0, the htab mappings sets the entire kernel text region
executable. But for relocatable kernel (e.g. kdump case) we only copy
interrupt vectors down to real 0 and never marked that region as
executable because in p7 and below we always get exception in real mode.
This patch fixes this issue by marking htab mapping range as executable
that overlaps with the interrupt vector region for relocatable kernel.
Thanks to Ben who helped me to debug this issue and find the root cause.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
According to Posix, if MAP_FIXED is specified mmap shall set ENOMEM if
the requested mapping exceeds the allowed range for address space of
the process. The generic code set it right, but the specific powerpc
slice_get_unmapped_area() function currently returns -EINVAL in that
case.
This patch corrects it.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
<<
This contains a fix for a chroma_defconfig build break that was
introduced by e6500 tablewalk support, and a device tree binding patch
that missed the previous pull request due to some last-minute polishing.
>>
Pull powerpc updates from Ben Herrenschmidt:
"So here's my next branch for powerpc. A bit late as I was on vacation
last week. It's mostly the same stuff that was in next already, I
just added two patches today which are the wiring up of lockref for
powerpc, which for some reason fell through the cracks last time and
is trivial.
The highlights are, in addition to a bunch of bug fixes:
- Reworked Machine Check handling on kernels running without a
hypervisor (or acting as a hypervisor). Provides hooks to handle
some errors in real mode such as TLB errors, handle SLB errors,
etc...
- Support for retrieving memory error information from the service
processor on IBM servers running without a hypervisor and routing
them to the memory poison infrastructure.
- _PAGE_NUMA support on server processors
- 32-bit BookE relocatable kernel support
- FSL e6500 hardware tablewalk support
- A bunch of new/revived board support
- FSL e6500 deeper idle states and altivec powerdown support
You'll notice a generic mm change here, it has been acked by the
relevant authorities and is a pre-req for our _PAGE_NUMA support"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (121 commits)
powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked()
powerpc: Add support for the optimised lockref implementation
powerpc/powernv: Call OPAL sync before kexec'ing
powerpc/eeh: Escalate error on non-existing PE
powerpc/eeh: Handle multiple EEH errors
powerpc: Fix transactional FP/VMX/VSX unavailable handlers
powerpc: Don't corrupt transactional state when using FP/VMX in kernel
powerpc: Reclaim two unused thread_info flag bits
powerpc: Fix races with irq_work
Move precessing of MCE queued event out from syscall exit path.
pseries/cpuidle: Remove redundant call to ppc64_runlatch_off() in cpu idle routines
powerpc: Make add_system_ram_resources() __init
powerpc: add SATA_MV to ppc64_defconfig
powerpc/powernv: Increase candidate fw image size
powerpc: Add debug checks to catch invalid cpu-to-node mappings
powerpc: Fix the setup of CPU-to-Node mappings during CPU online
powerpc/iommu: Don't detach device without IOMMU group
powerpc/eeh: Hotplug improvement
powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config space
powerpc/eeh: Add restore_config operation
...
...and make CONFIG_PPC_FSL_BOOK3E conflict with CONFIG_PPC_64K_PAGES.
This fixes a build break with CONFIG_PPC_64K_PAGES on 64-bit book3e,
that was introduced by commit 28efc35fe6
("powerpc/e6500: TLB miss handler with hardware tablewalk support").
Signed-off-by: Scott Wood <scottwood@freescale.com>
add_system_ram_resources() is a subsys_initcall.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There have been some weird bugs in the past where the kernel tried to associate
threads of the same core to different NUMA nodes, and things went haywire after
that point (as expected).
But unfortunately, root-causing such issues have been quite challenging, due to
the lack of appropriate debug checks in the kernel. These bugs usually lead to
some odd soft-lockups in the scheduler's build-sched-domain code in the CPU
hotplug path, which makes it very hard to trace it back to the incorrect
cpu-to-node mappings.
So add appropriate debug checks to catch such invalid cpu-to-node mappings
as early as possible.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On POWER platforms, the hypervisor can notify the guest kernel about dynamic
changes in the cpu-numa associativity (VPHN topology update). Hence the
cpu-to-node mappings that we got from the firmware during boot, may no longer
be valid after such updates. This is handled using the arch_update_cpu_topology()
hook in the scheduler, and the sched-domains are rebuilt according to the new
mappings.
But unfortunately, at the moment, CPU hotplug ignores these updated mappings
and instead queries the firmware for the cpu-to-numa relationships and uses
them during CPU online. So the kernel can end up assigning wrong NUMA nodes
to CPUs during subsequent CPU hotplug online operations (after booting).
Further, a particularly problematic scenario can result from this bug:
On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8)
threads per core. The switch to Single-Threaded (ST) mode is performed by
offlining all except the first CPU thread in each core. Switching back to
SMT mode involves onlining those other threads back, in each core.
Now consider this scenario:
1. During boot, the kernel gets the cpu-to-node mappings from the firmware
and assigns the CPUs to NUMA nodes appropriately, during CPU online.
2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and
communicates this update to the kernel. The kernel in turn updates its
cpu-to-node associations and rebuilds its sched domains. Everything is
fine so far.
3. Now, the user switches the machine from SMT to ST mode (say, by running
ppc64_cpu --smt=1). This involves offlining all except 1 thread in each
core.
4. The user then tries to switch back from ST to SMT mode (say, by running
ppc64_cpu --smt=4), and this involves onlining those threads back. Since
CPU hotplug ignores the new mappings, it queries the firmware and tries to
associate the newly onlined sibling threads to the old NUMA nodes. This
results in sibling threads within the same core getting associated with
different NUMA nodes, which is incorrect.
The scheduler's build-sched-domains code gets thoroughly confused with this
and enters an infinite loop and causes soft-lockups, as explained in detail
in commit 3be7db6ab (powerpc: VPHN topology change updates all siblings).
So to fix this, use the numa_cpu_lookup_table to remember the updated
cpu-to-node mappings, and use them during CPU hotplug online operations.
Further, we also need to ensure that all threads in a core are assigned to a
common NUMA node, irrespective of whether all those threads were online during
the topology update. To achieve this, we take care not to use cpu_sibling_mask()
since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread()
and set up the mappings manually using the 'threads_per_core' value for that
particular platform. This helps us ensure that we don't hit this bug with any
combination of CPU hotplug and SMT mode switching.
Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
None of these files are actually using any __init type directives
and hence don't need to include <linux/init.h>. Most are just a
left over from __devinit and __cpuinit removal, or simply due to
code getting copied from one driver to the next.
The one instance where we add an include for init.h covers off
a case where that file was implicitly getting it from another
header which itself didn't need it.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
It was branching to the cleanup part of the non-bolted handler,
which would have been bad if there were any chips with tlbsrx.
that use the bolted handler.
Signed-off-by: Scott Wood <scottwood@freescale.com>
This keeps usage coordinated for hugetlb and indirect entries, which
should make entry selection more predictable and probably improve overall
performance when mixing the two.
Signed-off-by: Scott Wood <scottwood@freescale.com>
There are a few things that make the existing hw tablewalk handlers
unsuitable for e6500:
- Indirect entries go in TLB1 (though the resulting direct entries go in
TLB0).
- It has threads, but no "tlbsrx." -- so we need a spinlock and
a normal "tlbsx". Because we need this lock, hardware tablewalk
is mandatory on e6500 unless we want to add spinlock+tlbsx to
the normal bolted TLB miss handler.
- TLB1 has no HES (nor next-victim hint) so we need software round robin
(TODO: integrate this round robin data with hugetlb/KVM)
- The existing tablewalk handlers map half of a page table at a time,
because IBM hardware has a fixed 1MiB indirect page size. e6500
has variable size indirect entries, with a minimum of 2MiB.
So we can't do the half-page indirect mapping, and even if we
could it would be less efficient than mapping the full page.
- Like on e5500, the linear mapping is bolted, so we don't need the
overhead of supporting nested tlb misses.
Note that hardware tablewalk does not work in rev1 of e6500.
We do not expect to support e6500 rev1 in mainline Linux.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Mihai Caraman <mihai.caraman@freescale.com>
There is no barrier between something like ioremap() writing to
a PTE, and returning the value to a caller that may then store the
pointer in a place that is visible to other CPUs. Such callers
generally don't perform barriers of their own.
Even if callers of ioremap() and similar things did use barriers,
the most logical choise would be smp_wmb(), which is not
architecturally sufficient when BookE hardware tablewalk is used. A
full sync is specified by the architecture.
For userspace mappings, OTOH, we generally already have an lwsync due
to locking, and if we occasionally take a spurious fault due to not
having a full sync with hardware tablewalk, it will not be fatal
because we will retry rather than oops.
Signed-off-by: Scott Wood <scottwood@freescale.com>
When booting above the 64M for a secondary cpu, we also face the
same issue as the boot cpu that the PAGE_OFFSET map two different
physical address for the init tlb and the final map. So we have to use
switch_to_as1/restore_to_as0 between the conversion of these two
maps. When restoring to as0 for a secondary cpu, we only need to
return to the caller. So add a new parameter for function
restore_to_as0 for this purpose.
Use LOAD_REG_ADDR_PIC to get the address of variables which may
be used before we set the final map in cams for the secondary cpu.
Move the setting of cams a bit earlier in order to avoid the
unnecessary using of LOAD_REG_ADDR_PIC.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
This is always true for a non-relocatable kernel. Otherwise the kernel
would get stuck. But for a relocatable kernel, it seems a little
complicated. When booting a relocatable kernel, we just align the
kernel start addr to 64M and map the PAGE_OFFSET from there. The
relocation will base on this virtual address. But if this address
is not the same as the memstart_addr, we will have to change the
map of PAGE_OFFSET to the real memstart_addr and do another relocation
again.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
[scottwood@freescale.com: make offset long and non-negative in simple case]
Signed-off-by: Scott Wood <scottwood@freescale.com>
Introduce this function so we can set both the physical and virtual
address for the map in cams. This will be used by the relocation code.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
We use the tlb1 entries to map low mem to the kernel space. In the
current code, it assumes that the first tlb entry would cover the
kernel image. But this is not true for some special cases, such as
when we run a relocatable kernel above the 64M or set
CONFIG_KERNEL_START above 64M. So we choose to switch to address
space 1 before setting these tlb entries.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
This is based on the codes in the head_44x.S. The difference is that
the init tlb size we used is 64M. With this patch we can only load the
kernel at address between memstart_addr ~ memstart_addr + 64M. We will
fix this restriction in the following patches.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
The e500v1 doesn't implement the MAS7, so we should avoid to access
this register on that implementations. In the current kernel, the
access to MAS7 are protected by either CONFIG_PHYS_64BIT or
MMU_FTR_BIG_PHYS. Since some code are executed before the code
patching, we have to use CONFIG_PHYS_64BIT in these cases.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
We want to make sure we don't use these function when updating a pte
or pmd entry that have a valid hpte entry, because these functions
don't invalidate them. So limit the check to _PAGE_PRESENT bit.
Numafault core changes use these functions for updating _PAGE_NUMA bits.
That should be ok because when _PAGE_NUMA is set we can be sure that
hpte entries are not present.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Set memory coherence always on hash64 config. If
a platform cannot have memory coherence always set they
can infer that from _PAGE_NO_CACHE and _PAGE_WRITETHRU
like in lpar. So we dont' really need a separate bit
for tracking _PAGE_COHERENCE.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
So that it can be used by other codes. No function change.
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
And in flush_hugetlb_page(), don't check whether vma is NULL after
we've already dereferenced it.
This was found by Dan using static analysis as described here:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2013-November/113161.html
We currently get away with this because the callers that currently pass
NULL for vma seem to be 32-bit-only (e.g. highmem, and
CONFIG_DEBUG_PGALLOC in pgtable_32.c) Hugetlb is currently 64-bit only,
so we never saw a NULL vma here.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Commit fba2369e6c (mm: use vm_unmapped_area() on powerpc architecture)
has a bug in slice_scan_available() where we compare an unsigned long
(high_slices) against a shifted int. As a result, comparisons against
the top 32 bits of high_slices (representing the top 32TB) always
returns 0 and the top of our mmap region is clamped at 32TB
This also breaks mmap randomisation since the randomised address is
always up near the top of the address space and it gets clamped down
to 32TB.
Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
__get_user_pages_fast() may be called with interrupts disabled (see e.g.
get_futex_key() in kernel/futex.c) and therefore should use local_irq_save()
and local_irq_restore() instead of local_irq_disable()/enable().
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: <stable@vger.kernel.org> [v3.12]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn +
pgdat->node_spanned_pages". Simplify the code, no functional change.
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The callers of free_pgd_range() and hugetlb_free_pgd_range() don't hold
page table locks. The comments seems to be obsolete, so let's remove
them.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use __free_reserved_page() to simplify the code in arch.
It used split_page() in consistent_alloc()/__dma_alloc_coherent()/dma_alloc_coherent(),
so page->_count == 1, and we can free it safely.
__free_reserved_page()
ClearPageReserved()
init_page_count() // it won't change the value
__free_page()
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simple fixes for sparse warnings in this file.
Resolves:
arch/powerpc/mm/numa.c:198:24:
warning: Using plain integer as NULL pointer
arch/powerpc/mm/numa.c:1157:5:
warning: symbol 'hot_add_node_scn_to_nid' was not declared.
Should it be static?
arch/powerpc/mm/numa.c:1238:28:
warning: Using plain integer as NULL pointer
arch/powerpc/mm/numa.c:1538:6:
warning: symbol 'topology_schedule_update' was not declared.
Should it be static?
Signed-off-by: Robert C Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Activating CONFIG_PIN_TLB allows access to the 24 first Mbytes of
memory at bootup instead of 8. It is needed for "big" kernels for
instance when activating CONFIG_LOCKDEP_SUPPORT. This needs to be
taken into account in init_32 too, otherwise memory allocation soon
fails after startup.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Acked-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
Signed-off-by: Scott Wood <scottwood@freescale.com>
The commit e0908085fc ("powerpc/8xx: Fix
regression introduced by cache coherency rewrite") is not needed
anymore. The issue was because dcbst wrongly sets the store bit when
causing a DTLB error, but this is now fixed by commit
0a2ab51ffb ("powerpc/8xx: Fixup DAR from
buggy dcbX instructions.") which handles the buggy dcbx instructions on
data page faults on the 8xx.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[scottwood@freescale.com: fix commit message]
Signed-off-by: Scott Wood <scottwood@freescale.com>
Topic branch for commits that the KVM tree might want to pull
in separately.
Hand merged a few files due to conflicts with the LE stuff
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).
To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.
This adds an API function realmode_pfn_to_page() to get page struct when
MMU is off.
This adds to MM a new function put_page_unless_one() which drops a page
if counter is bigger than 1. It is going to be used when MMU is off
(for example, real mode on PPC64) and we want to make sure that page
release will not happen in real mode as it may crash the kernel in
a horrible way.
CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.
Cc: linux-mm@kvack.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Previous commit 46723bfa540... introduced a new config option
HAVE_BOOTMEM_INFO_NODE that ended up breaking memory hot-remove for ppc
when sparse vmemmap is not defined.
This patch defines HAVE_BOOTMEM_INFO_NODE for ppc and adds the call to
register_page_bootmem_info_node. Without this we get a BUG_ON for memory
hot remove in put_page_bootmem().
This also adds a stub for register_page_bootmem_memmap to allow ppc to build
with sparse vmemmap defined. Leaving this as a stub is fine since the same
vmemmap addresses are also handled in vmemmap_populate and as such are
properly mapped.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org> [v3.9+]
Unlike global OOM handling, memory cgroup code will invoke the OOM killer
in any OOM situation because it has no way of telling faults occuring in
kernel context - which could be handled more gracefully - from
user-triggered faults.
Pass a flag that identifies faults originating in user space from the
architecture-specific fault handlers to generic code so that memcg OOM
handling can be improved.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: azurIt <azurit@pobox.sk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently hugepage migration works well only for pmd-based hugepages
(mainly due to lack of testing,) so we had better not enable migration of
other levels of hugepages until we are ready for it.
Some users of hugepage migration (mbind, move_pages, and migrate_pages) do
page table walk and check pud/pmd_huge() there, so they are safe. But the
other users (softoffline and memory hotremove) don't do this, so without
this patch they can try to migrate unexpected types of hugepages.
To prevent this, we introduce hugepage_migration_support() as an
architecture dependent check of whether hugepage are implemented on a pmd
basis or not. And on some architecture multiple sizes of hugepages are
available, so hugepage_migration_support() also checks hugepage size.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull powerpc updates from Ben Herrenschmidt:
"Here's the powerpc batch for this merge window. Some of the
highlights are:
- A bunch of endian fixes ! We don't have full LE support yet in that
release but this contains a lot of fixes all over arch/powerpc to
use the proper accessors, call the firmware with the right endian
mode, etc...
- A few updates to our "powernv" platform (non-virtualized, the one
to run KVM on), among other, support for bridging the P8 LPC bus
for UARTs, support and some EEH fixes.
- Some mpc51xx clock API cleanups in preparation for a clock API
overhaul
- A pile of cleanups of our old math emulation code, including better
support for using it to emulate optional FP instructions on
embedded chips that otherwise have a HW FPU.
- Some infrastructure in selftest, for powerpc now, but could be
generalized, initially used by some tests for our perf instruction
counting code.
- A pile of fixes for hotplug on pseries (that was seriously
bitrotting)
- The usual slew of freescale embedded updates, new boards, 64-bit
hiberation support, e6500 core PMU support, etc..."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (146 commits)
powerpc: Correct FSCR bit definitions
powerpc/xmon: Fix printing of set of CPUs in xmon
powerpc/pseries: Move lparcfg.c to platforms/pseries
powerpc/powernv: Return secondary CPUs to firmware on kexec
powerpc/btext: Fix CONFIG_PPC_EARLY_DEBUG_BOOTX on ppc32
powerpc: Cleanup handling of the DSCR bit in the FSCR register
powerpc/pseries: Child nodes are not detached by dlpar_detach_node
powerpc/pseries: Add mising of_node_put in delete_dt_node
powerpc/pseries: Make dlpar_configure_connector parent node aware
powerpc/pseries: Do all node initialization in dlpar_parse_cc_node
powerpc/pseries: Fix parsing of initial node path in update_dt_node
powerpc/pseries: Pack update_props_workarea to map correctly to rtas buffer header
powerpc/pseries: Fix over writing of rtas return code in update_dt_node
powerpc/pseries: Fix creation of loop in device node property list
powerpc: Skip emulating & leave interrupts off for kernel program checks
powerpc: Add more exception trampolines for hypervisor exceptions
powerpc: Fix location and rename exception trampolines
powerpc: Add more trap names to xmon
powerpc/pseries: Add a warning in the case of cross-cpu VPA registration
powerpc: Update the 00-Index in Documentation/powerpc
...
Pull trivial tree from Jiri Kosina:
"The usual trivial updates all over the tree -- mostly typo fixes and
documentation updates"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
doc: Documentation/cputopology.txt fix typo
treewide: Convert retrun typos to return
Fix comment typo for init_cma_reserved_pageblock
Documentation/trace: Correcting and extending tracepoint documentation
mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
power: Documentation: Update s2ram link
doc: fix a typo in Documentation/00-INDEX
Documentation/printk-formats.txt: No casts needed for u64/s64
doc: Fix typo "is is" in Documentations
treewide: Fix printks with 0x%#
zram: doc fixes
Documentation/kmemcheck: update kmemcheck documentation
doc: documentation/hwspinlock.txt fix typo
PM / Hibernate: add section for resume options
doc: filesystems : Fix typo in Documentations/filesystems
scsi/megaraid fixed several typos in comments
ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
page_isolation: Fix a comment typo in test_pages_isolated()
doc: fix a typo about irq affinity
...
Memory I/O resources need to be marked as busy or else we cannot remove
them when doing memory hot remove.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The lppaca, slb_shadow and dtl_entry hypervisor structures are
big endian, so we have to byte swap them in little endian builds.
LE KVM hosts will also need to be fixed but for now add an #error
to remind us.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The device tree is big endian so make sure we byteswap on little
endian. We assume any pHyp calls also return big endian results in
memory.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Other architectures have a __get_user_pages_fast(), in addition to the
regular get_user_pages_fast(), which doesn't call get_user_pages() on
failure, and thus doesn't attempt to fault pages in or COW them. The
generic KVM code uses __get_user_pages_fast() to detect whether a page
for which we have only requested read access is actually writable.
This provides an implementation of __get_user_pages_fast() by
splitting the existing get_user_pages_fast() in two. With this, the
generic KVM code will get the right answer instead of always
considering such pages non-writable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Although the shared_proc field in the lppaca works today, it is
not architected. A shared processor partition will always have a non
zero yield_count so use that instead. Create a wrapper so users
don't have to know about the details.
In order for older kernels to continue to work on KVM we need
to set the shared_proc bit. While here, remove the ugly bitfield.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Address some of the trivial sparse warnings in arch/powerpc.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When an associativity level change is found for one thread, the
siblings threads need to be updated as well. This is done today
for PRRN in stage_topology_update() but is missing for VPHN in
update_cpu_associativity_changes_mask(). This patch will correctly
update all thread siblings during a topology change.
Without this patch a topology update can result in a CPU in
init_sched_groups_power() getting stuck indefinitely in a loop.
This loop is built in build_sched_groups(). As a result of the thread
moving to a node separate from its siblings the struct sched_group will
have its next pointer set to point to itself rather than the sched_group
struct of the next thread. This happens because we have a domain without
the SD_OVERLAP flag, which is correct, and a topology that doesn't conform
with reality (threads on the same core assigned to different numa nodes).
When this list is traversed by init_sched_groups_power() it will reach
the thread's sched_group structure and loop indefinitely; the cpu will
be stuck at this point.
The bug was exposed when VPHN was enabled in commit b7abef0 (v3.9).
Cc: <stable@vger.kernel.org> [v3.9+]
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The sllp value is stored in mmu_psize_defs in such a way that we can easily OR
the value to get the operand for slbmte instruction. ie, the L and LP bits are
not contiguous. Decode the bits and use them correctly in tlbie.
regression is introduced by 1f6aaaccb1
"powerpc: Update tlbie/tlbiel as per ISA doc"
Reported-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We should not fallthrough different case statements in hpte_decode. Add
break statement to break out of the switch. The regression is introduced by
dcda287a9b "powerpc/mm: Simplify hpte_decode"
Reported-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull powerpc updates from Ben Herrenschmidt:
"This is the powerpc changes for the 3.11 merge window. In addition to
the usual bug fixes and small updates, the main highlights are:
- Support for transparent huge pages by Aneesh Kumar for 64-bit
server processors. This allows the use of 16M pages as transparent
huge pages on kernels compiled with a 64K base page size.
- Base VFIO support for KVM on power by Alexey Kardashevskiy
- Wiring up of our nvram to the pstore infrastructure, including
putting compressed oopses in there by Aruna Balakrishnaiah
- Move, rework and improve our "EEH" (basically PCI error handling
and recovery) infrastructure. It is no longer specific to pseries
but is now usable by the new "powernv" platform as well (no
hypervisor) by Gavin Shan.
- I fixed some bugs in our math-emu instruction decoding and made it
usable to emulate some optional FP instructions on processors with
hard FP that lack them (such as fsqrt on Freescale embedded
processors).
- Support for Power8 "Event Based Branch" facility by Michael
Ellerman. This facility allows what is basically "userspace
interrupts" for performance monitor events.
- A bunch of Transactional Memory vs. Signals bug fixes and HW
breakpoint/watchpoint fixes by Michael Neuling.
And more ... I appologize in advance if I've failed to highlight
something that somebody deemed worth it."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
pstore: Add hsize argument in write_buf call of pstore_ftrace_call
powerpc/fsl: add MPIC timer wakeup support
powerpc/mpic: create mpic subsystem object
powerpc/mpic: add global timer support
powerpc/mpic: add irq_set_wake support
powerpc/85xx: enable coreint for all the 64bit boards
powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx
powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig
powerpc/mpic: Add get_version API both for internal and external use
powerpc: Handle both new style and old style reserve maps
powerpc/hw_brk: Fix off by one error when validating DAWR region end
powerpc/pseries: Support compression of oops text via pstore
powerpc/pseries: Re-organise the oops compression code
pstore: Pass header size in the pstore write callback
powerpc/powernv: Fix iommu initialization again
powerpc/pseries: Inform the hypervisor we are using EBB regs
powerpc/perf: Add power8 EBB support
powerpc/perf: Core EBB support for 64-bit book3s
powerpc/perf: Drop MMCRA from thread_struct
powerpc/perf: Don't enable if we have zero events
...
Prepare for killing free_all_bootmem_node() by using free_all_bootmem().
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Alexander Graf <agraf@suse.de>
Cc: "Suzuki K. Poulose" <suzuki@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prepare for removing num_physpages and simplify mem_init().
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Concentrate code to modify totalram_pages into the mm core, so the arch
memory initialized code doesn't need to take care of it. With these
changes applied, only following functions from mm core modify global
variable totalram_pages: free_bootmem_late(), free_all_bootmem(),
free_all_bootmem_node(), adjust_managed_page_count().
With this patch applied, it will be much more easier for us to keep
totalram_pages and zone->managed_pages in consistence.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Address more review comments from last round of code review.
1) Enhance free_reserved_area() to support poisoning freed memory with
pattern '0'. This could be used to get rid of poison_init_mem()
on ARM64.
2) A previous patch has disabled memory poison for initmem on s390
by mistake, so restore to the original behavior.
3) Remove redundant PAGE_ALIGN() when calling free_reserved_area().
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change signature of free_reserved_area() according to Russell King's
suggestion to fix following build warnings:
arch/arm/mm/init.c: In function 'mem_init':
arch/arm/mm/init.c:603:2: warning: passing argument 1 of 'free_reserved_area' makes integer from pointer without a cast [enabled by default]
free_reserved_area(__va(PHYS_PFN_OFFSET), swapper_pg_dir, 0, NULL);
^
In file included from include/linux/mman.h:4:0,
from arch/arm/mm/init.c:15:
include/linux/mm.h:1301:22: note: expected 'long unsigned int' but argument is of type 'void *'
extern unsigned long free_reserved_area(unsigned long start, unsigned long end,
mm/page_alloc.c: In function 'free_reserved_area':
>> mm/page_alloc.c:5134:3: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast [enabled by default]
In file included from arch/mips/include/asm/page.h:49:0,
from include/linux/mmzone.h:20,
from include/linux/gfp.h:4,
from include/linux/mm.h:8,
from mm/page_alloc.c:18:
arch/mips/include/asm/io.h:119:29: note: expected 'const volatile void *' but argument is of type 'long unsigned int'
mm/page_alloc.c: In function 'free_area_init_nodes':
mm/page_alloc.c:5030:34: warning: array subscript is below array bounds [-Warray-bounds]
Also address some minor code review comments.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the already existing interface huge_page_shift instead of h->order +
PAGE_SHIFT.
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJR0K2gAAoJEHm+PkMAQRiGWsEH+gMZSN1qRm34hZ82q1Tx7HvL
Eb/Gsl3Qw/7G2TlTqgjBUs36IdqV9O2cui/aa3/TfXvdvrx+0GlhRkEwQPc+ygcO
Mvoyoke4tT4+4jVFdCg1J8avREsa28/6oaHs0ZZxuVmJBBLTJH7aXaNsGn6eU1q9
9+p798MQis6naIiPC63somlZcCIiBhsuWCPWpEfLMn8G1HWAFTM3xXIbNBqe/brS
bmIOfhomlIZ5dcdaXGvjtP3+KJhkNDwhkPC4tVYu8JqqgSlrE+a+EGyEuuGqKk10
U+swiqyuD31uBI9ga54u/2FzSqDiAu6YOcMXevjo/m3g9XLdYbYLvN+nvN8alCQ=
=Ob6Z
-----END PGP SIGNATURE-----
Merge tag 'v3.10' into next
Merge 3.10 in order to get some of the last minute powerpc
changes, resolve conflicts and add additional fixes on top
of them.
The topology update code that updates the cpu node registration in sysfs
should not be called while in stop_machine(). The register/unregister
calls take a lock and may sleep.
This patch moves these calls outside of the call to stop_machine().
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications. For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.
After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out. Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.
This removes all the powerpc uses of the __cpuinit macros. There
are no __CPUINIT users in assembly files in powerpc.
[1] https://lkml.org/lkml/2013/5/20/589
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Josh Boyer <jwboyer@gmail.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Hugepage invalidate involves invalidating multiple hpte entries.
Optimize the operation using H_BULK_REMOVE on lpar platforms.
On native, reduce the number of tlb flush.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We enable only if the we support 16MB page size.
Reviewed-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We find all the overlapping vma and mark them such that we don't allocate
hugepage in that range. Also we split existing huge page so that the
normal page hash can be invalidated and new page faulted in with new
protection bits.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
With THP we set pmd to none, before we do pte_clear. Hence we can't
walk page table to get the pte lock ptr and verify whether it is locked.
THP do take pte lock before calling pte_clear. So we don't change the locking
rules here. It is that we can't use page table walking to check whether
pte locks are held with THP.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
GCC is very likely to read the pagetables just once and cache them in
the local stack or in a register, but it is can also decide to re-read
the pagetables. The problem is that the pagetable in those places can
change from under gcc.
With THP/hugetlbfs the pmd (and pgd for hugetlbfs giga pages) can
change under gup_fast. The pages won't be freed untill we finish
gup fast because we have irq disabled and we free these pages via
rcu callback.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We need to have irqs disabled to handle all the possible parallel update for
linux page table without holding locks.
Events that we are intersted in while walking page tables are
1) Page fault
2) umap
3) THP split
4) THP collapse
A) local_irq_disabled:
------------------------
1) page fault:
A none to valid transition via page fault is not an issue because we
would either see a none or valid. If it is none, we would error out
the page table walk. We may need to use on stack values when checking for
type of page table elements, because if we do
if (!is_hugepd()) {
if (!pmd_none() {
if (pmd_bad() {
We could take that bad condition because the pmd got converted to a hugepd
after the !is_hugepd check via a hugetlb fault.
The right way would be to check for pmd_none higher up or use on stack value.
2) A valid to none conversion via unmap:
We can safely walk the upper level table, because we don't remove the the
page table entries until rcu grace period. So even if we followed a
wrong pointer we still have the pointer valid till the grace period.
A PTE pointer returned need to be atomically checked for _PAGE_PRESENT and
_PAGE_BUSY. A valid pointer returned could becoming none later. To prevent
pte_clear we take _PAGE_BUSY.
3) THP split:
A valid transparent hugepage is converted to nomal page. Before we split we
do pmd_splitting_flush, which sets the hugepage PTE to _PAGE_SPLITTING
So when walking page table we need to check for pmd_trans_splitting and
handle that. The pte returned should also need to be checked for
_PAGE_SPLITTING before setting _PAGE_BUSY similar to _PAGE_PRESENT. We save
the value of PTE on stack and check for the flag in the local pte value.
If we don't have the value set we can safely operate on the local pte value
and we atomicaly set _PAGE_BUSY.
4) THP collapse:
A normal page gets converted to hugepage. In the collapse path, we
mark the pmd none early (pmdp_clear_flush). With irq disabled, if we
are aleady walking page table we would see the pmd_none and won't continue.
If we see a valid PMD, we should still check for _PAGE_PRESENT before
setting _PAGE_BUSY, to make sure we didn't collapse the PTE to a Huge PTE.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The deposted PTE page in the second half of the PMD table is used to
track the state on hash PTEs. After updating the HPTE, we mark the
coresponding slot in the deposted PTE page valid.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly
document why we don't need to handle transparent hugepages at callsites.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We will use this in the later patch for handling THP pages
Reviewed-by: David Gibson <dwg@au1.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We now have pmd entries covering 16MB range and the PMD table double its original size.
We use the second half of the PMD table to deposit the pgtable (PTE page).
The depoisted PTE page is further used to track the HPTE information. The information
include [ secondary group | 3 bit hidx | valid ]. We use one byte per each HPTE entry.
With 16MB hugepage and 64K HPTE we need 256 entries and with 4K HPTE we need
4096 entries. Both will fit in a 4K PTE page. On hugepage invalidate we need to walk
the PTE page and invalidate all valid HPTEs.
This patch implements necessary arch specific functions for THP support and also
hugepage invalidate logic. These PMD related functions are intentionally kept
similar to their PTE counter-part.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
THP code does PTE page allocation along with large page request and deposit them
for later use. This is to ensure that we won't have any failures when we split
hugepages to regular pages.
On powerpc we want to use the deposited PTE page for storing hash pte slot and
secondary bit information for the HPTEs. We use the second half
of the pmd table to save the deposted PTE page.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If a hash bucket gets full, we "evict" a more/less random entry from it.
When we do that we don't invalidate the TLB (hpte_remove) because we assume
the old translation is still technically "valid". This implies that when
we are invalidating or updating pte, even if HPTE entry is not valid
we should do a tlb invalidate. With hugepages, we need to pass the correct
actual page size value for tlb invalidation.
This change update the patch 0608d69246
"powerpc/mm: Always invalidate tlb on hpte invalidate and update" to handle
transparent hugepages correctly.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This happens with threads that are offline due to CPU hotplug
(including threads that were never "plugged in" to begin with because
SMT is disabled).
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Book3E uses the hugepd at PMD level and don't encode pte directly
at the pmd level. So it will find the lower bits of pmd set
and the pmd_bad check throws error. Infact the current code
will never take the free_hugepd_range call at all because it will
clear the pmd if it find a hugepd pointer. Fix this by clearing
bad pmd only if it is not a hugepd pointer.
This is regression introduced by e2b3d202d1
"powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format"
Reported-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Ever since commit 45f035ab9b ("CONFIG_HOTPLUG should be always on"),
it has been basically impossible to build a kernel with CONFIG_HOTPLUG
turned off. Remove all the remaining references to it.
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If a hash bucket gets full, we "evict" a more/less random entry from it.
When we do that we don't invalidate the TLB (hpte_remove) because we assume
the old translation is still technically "valid". This implies that when
we are invalidating or updating pte, even if HPTE entry is not valid
we should do a tlb invalidate.
This was a regression introduced by b1022fbd29
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This is the exception hooks for context tracking subsystem, including
data access, program check, single step, instruction breakpoint, machine check,
alignment, fp unavailable, altivec assist, unknown exception, whose handlers
might use RCU.
This patch corresponds to
[PATCH] x86: Exception hooks for userspace RCU extended QS
commit 6ba3c97a38
But after the exception handling moved to generic code, and some changes in
following two commits:
56dd9470d7
context_tracking: Move exception handling to generic code
6c1e0256fa
context_tracking: Restore correct previous context state on exception exit
it is able for exception hooks to use the generic code above instead of a
redundant arch implementation.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Make sure that current->thread.reg exists before we deference it in
flush_hash_page.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reported-by: John J Miller <millerjo@us.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull powerpc update from Benjamin Herrenschmidt:
"The main highlights this time around are:
- A pile of addition POWER8 bits and nits, such as updated
performance counter support (Michael Ellerman), new branch history
buffer support (Anshuman Khandual), base support for the new PCI
host bridge when not using the hypervisor (Gavin Shan) and other
random related bits and fixes from various contributors.
- Some rework of our page table format by Aneesh Kumar which fixes a
thing or two and paves the way for THP support. THP itself will
not make it this time around however.
- More Freescale updates, including Altivec support on the new e6500
cores, new PCI controller support, and a pile of new boards support
and updates.
- The usual batch of trivial cleanups & fixes"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
powerpc: Fix build error for book3e
powerpc: Context switch the new EBB SPRs
powerpc: Turn on the EBB H/FSCR bits
powerpc: Replace CPU_FTR_BCTAR with CPU_FTR_ARCH_207S
powerpc: Setup BHRB instructions facility in HFSCR for POWER8
powerpc: Fix interrupt range check on debug exception
powerpc: Update tlbie/tlbiel as per ISA doc
powerpc: Print page size info during boot
powerpc: print both base and actual page size on hash failure
powerpc: Fix hpte_decode to use the correct decoding for page sizes
powerpc: Decode the pte-lp-encoding bits correctly.
powerpc: Use encode avpn where we need only avpn values
powerpc: Reduce PTE table memory wastage
powerpc: Move the pte free routines from common header
powerpc: Reduce the PTE_INDEX_SIZE
powerpc: Switch 16GB and 16MB explicit hugepages to a different page table format
powerpc: New hugepage directory format
powerpc: Don't truncate pgd_index wrongly
powerpc: Don't hard code the size of pte page
powerpc: Save DAR and DSISR in pt_regs on MCE
...
Encode the actual page correctly in tlbie/tlbiel. This make sure we handle
multiple page size segment correctly.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This gives hint about different base and actual page size combination
supported by the platform.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
As per ISA doc, we encode base and actual page size in the LP bits of
PTE. The number of bit used to encode the page sizes depend on actual
page size. ISA doc lists this as
PTE LP actual page size
rrrr rrrz >=8KB
rrrr rrzz >=16KB
rrrr rzzz >=32KB
rrrr zzzz >=64KB
rrrz zzzz >=128KB
rrzz zzzz >=256KB
rzzz zzzz >=512KB
zzzz zzzz >=1MB
ISA doc also says
"The values of the “z” bits used to specify each size, along with all possible
values of “r” bits in the LP field, must result in LP values distinct from
other LP values for other sizes."
based on the above update hpte_decode to use the correct decoding for LP bits.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We look at both the segment base page size and actual page size and store
the pte-lp-encodings in an array per base page size.
We also update all relevant functions to take actual page size argument
so that we can use the correct PTE LP encoding in HPTE. This should also
get the basic Multiple Page Size per Segment (MPSS) support. This is needed
to enable THP on ppc64.
[Fixed PR KVM build --BenH]
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In all these cases we are doing something similar to
HPTE_V_COMPARE(hpte_v, want_v) which ignores the HPTE_V_LARGE bit
With MPSS support we would need actual page size to set HPTE_V_LARGE
bit and that won't be available in most of these cases. Since we are ignoring
HPTE_V_LARGE bit, use the avpn value instead. There should not be any change
in behaviour after this patch.
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We allocate one page for the last level of linux page table. With THP and
large page size of 16MB, that would mean we are wasting large part
of that page. To map 16MB area, we only need a PTE space of 2K with 64K
page size. This patch reduce the space wastage by sharing the page
allocated for the last level of linux page table with multiple pmd
entries. We call these smaller chunks PTE page fragments and allocated
page, PTE page.
In order to support systems which doesn't have 64K HPTE support, we also
add another 2K to PTE page fragment. The second half of the PTE fragments
is used for storing slot and secondary bit information of an HPTE. With this
we now have a 4K PTE fragment.
We use a simple approach to share the PTE page. On allocation, we bump the
PTE page refcount to 16 and share the PTE page with the next 16 pte alloc
request. This should help in the node locality of the PTE page fragment,
assuming that the immediate pte alloc request will mostly come from the
same NUMA node. We don't try to reuse the freed PTE page fragment. Hence
we could be waisting some space.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We will be switching PMD_SHIFT to 24 bits to facilitate THP impmenetation.
With PMD_SHIFT set to 24, we now have 16MB huge pages allocated at PGD level.
That means with 32 bit process we cannot allocate normal pages at
all, because we cover the entire address space with one pgd entry. Fix this
by switching to a new page table format for hugepages. With the new page table
format for 16GB and 16MB hugepages we won't allocate hugepage directory. Instead
we encode the PTE information directly at the directory level. This forces 16MB
hugepage at PMD level. This will also make the page take walk much simpler later
when we add the THP support.
With the new table format we have 4 cases for pgds and pmds:
(1) invalid (all zeroes)
(2) pointer to next table, as normal; bottom 6 bits == 0
(3) leaf pte for huge page, bottom two bits != 00
(4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Change the hugepage directory format so that we can have leaf ptes directly
at page directory avoiding the allocation of hugepage directory.
With the new table format we have 3 cases for pgds and pmds:
(1) invalid (all zeroes)
(2) pointer to next table, as normal; bottom 6 bits == 0
(4) hugepd pointer, bottom two bits == 00, next 4 bits indicate size of table
Instead of storing shift value in hugepd pointer we use mmu_psize_def index
so that we can fit all the supported hugepage size in 4 bits
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
USE PTRS_PER_PTE to indicate the size of pte page. To support THP,
later patches will be changing PTRS_PER_PTE value.
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Correct build failure for powerpc/pseries builds with CONFIG_SMP not defined.
The function cpu_sibling_mask has no meaning (or definition) when CONFIG_SMP
is not defined. Additionally, the updating of NUMA affinity for a CPU in a UP
system doesn't really make sense.
This patch ifdef's out the code making the affinity updates for PRRN events to
fix the following build break.
arch/powerpc/mm/numa.c: In function ‘stage_topology_update’:
arch/powerpc/mm/numa.c:1535: error: implicit declaration of function ‘cpu_sibling_mask’
arch/powerpc/mm/numa.c:1535: warning: passing argument 3 of ‘cpumask_or’ makes pointer from integer without a cast
make[1]: *** [arch/powerpc/mm/numa.o] Error 1
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
After merging the cgroup tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:
arch/powerpc/mm/numa.c: In function 'arch_update_cpu_topology':
arch/powerpc/mm/numa.c:1465:2: error: implicit declaration of function 'kzalloc' [-Werror=implicit-function-declaration]
arch/powerpc/mm/numa.c:1465:10: error: assignment makes pointer from integer without a cast [-Werror]
arch/powerpc/mm/numa.c:1497:2: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration]
Caused by commit 30c05350c3 ("powerpc/pseries: Use stop machine to
update cpu maps") from the powerpc tree interacting with (probably)
commit ff794dea52 ("cpuset: remove include of cgroup.h from cpuset.h")
from the cgroup tree. Removing includes from header files is fraught
with danger ...
The former should have added an include of linux/slab.h to
arch/powerpc/mm/numa.c.
I have added the following merge fix patch for today (but it should be
applied to the powerpc tree ASAP).
From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 29 Apr 2013 14:01:44 +1000
Subject: [PATCH] powerpc: numa.c: using kzalloc/kfree requires including
slab.h
fixes these build errors:
arch/powerpc/mm/numa.c: In function 'arch_update_cpu_topology':
arch/powerpc/mm/numa.c:1465:2: error: implicit declaration of function 'kzalloc' [-Werror=implicit-function-declaration]
arch/powerpc/mm/numa.c:1465:10: error: assignment makes pointer from integer without a cast [-Werror]
arch/powerpc/mm/numa.c:1497:2: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration]
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull cgroup updates from Tejun Heo:
- Fixes and a lot of cleanups. Locking cleanup is finally complete.
cgroup_mutex is no longer exposed to individual controlelrs which
used to cause nasty deadlock issues. Li fixed and cleaned up quite a
bit including long standing ones like racy cgroup_path().
- device cgroup now supports proper hierarchy thanks to Aristeu.
- perf_event cgroup now supports proper hierarchy.
- A new mount option "__DEVEL__sane_behavior" is added. As indicated
by the name, this option is to be used for development only at this
point and generates a warning message when used. Unfortunately,
cgroup interface currently has too many brekages and inconsistencies
to implement a consistent and unified hierarchy on top. The new flag
is used to collect the behavior changes which are necessary to
implement consistent unified hierarchy. It's likely that this flag
won't be used verbatim when it becomes ready but will be enabled
implicitly along with unified hierarchy.
The option currently disables some of broken behaviors in cgroup core
and also .use_hierarchy switch in memcg (will be routed through -mm),
which can be used to make very unusual hierarchy where nesting is
partially honored. It will also be used to implement hierarchy
support for blk-throttle which would be impossible otherwise without
introducing a full separate set of control knobs.
This is essentially versioning of interface which isn't very nice but
at this point I can't see any other options which would allow keeping
the interface the same while moving towards hierarchy behavior which
is at least somewhat sane. The planned unified hierarchy is likely
to require some level of adaptation from userland anyway, so I think
it'd be best to take the chance and update the interface such that
it's supportable in the long term.
Maintaining the existing interface does complicate cgroup core but
shouldn't put too much strain on individual controllers and I think
it'd be manageable for the foreseeable future. Maybe we'll be able
to drop it in a decade.
Fix up conflicts (including a semantic one adding a new #include to ppc
that was uncovered by header the file changes) as per Tejun.
* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
cpuset: fix compile warning when CONFIG_SMP=n
cpuset: fix cpu hotplug vs rebuild_sched_domains() race
cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
cgroup: restore the call to eventfd->poll()
cgroup: fix use-after-free when umounting cgroupfs
cgroup: fix broken file xattrs
devcg: remove parent_cgroup.
memcg: force use_hierarchy if sane_behavior
cgroup: remove cgrp->top_cgroup
cgroup: introduce sane_behavior mount option
move cgroupfs_root to include/linux/cgroup.h
cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
cgroup: make cgroup_path() not print double slashes
Revert "cgroup: remove bind() method from cgroup_subsys."
perf: make perf_event cgroup hierarchical
cgroup: implement cgroup_is_descendant()
cgroup: make sure parent won't be destroyed before its children
cgroup: remove bind() method from cgroup_subsys.
devcg: remove broken_hierarchy tag
cgroup: remove cgroup_lock_is_held()
...
From Kumar Gala:
<<
Add support for T4 and B4 SoC families from Freescale, e6500 altivec
support, some various board fixes and other minor cleanups.
>>
Update the powerpc slice_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.
Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Tested-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
As all other architectures have been converted to use vm_unmapped_area(),
we are about to retire the free_area_cache.
This change simply removes the use of that cache in
slice_get_unmapped_area(), which will most certainly have a
performance cost. Next one will convert that function to use the
vm_unmapped_area() infrastructure and regain the performance.
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The sparse code, when asking the architecture to populate the vmemmap,
specifies the section range as a starting page and a number of pages.
This is an awkward interface, because none of the arch-specific code
actually thinks of the range in terms of 'struct page' units and always
translates it to bytes first.
In addition, later patches mix huge page and regular page backing for
the vmemmap. For this, they need to call vmemmap_populate_basepages()
on sub-section ranges with PAGE_SIZE and PMD_SIZE in mind. But these
are not necessarily multiples of the 'struct page' size and so this unit
is too coarse.
Just translate the section range into bytes once in the generic sparse
code, then pass byte ranges down the stack.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Bernhard Schmidt <Bernhard.Schmidt@lrz.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: David S. Miller <davem@davemloft.net>
Tested-by: David S. Miller <davem@davemloft.net>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use helper function free_highmem_page() to free highmem pages into
the buddy system.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: "Suzuki K. Poulose" <suzuki@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Anatolij Gustschin <agust@denx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are instances in which we do not want topology updates to occur.
In order to allow this a /proc interface (/proc/powerpc/topology_updates)
is introduced so that topology updates can be enabled and disabled.
This patch also adds a prrn_is_enabled() call so that PRRN events are
handled in the kernel only if topology updating is enabled.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The new PRRN firmware feature provides a more convenient and event-driven
interface than VPHN for notifying Linux of changes to the NUMA affinity of
platform resources. However, for practical reasons, it may not be feasible
for some customers to update to the latest firmware. For these customers,
the VPHN feature supported on previous firmware versions may still be the
best option.
The VPHN feature was previously disabled due to races with the load
balancing code when accessing the NUMA cpu maps, but the new stop_machine()
approach protects the NUMA cpu maps from these concurrent accesses. It
should be safe to re-enable this feature now.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The following patch adds vdso_getcpu_init(), which stores the NUMA node for
a cpu in SPRG3:
Commit 18ad51dd34 ("powerpc: Add VDSO version of getcpu") adds
vdso_getcpu_init(), which stores the NUMA node for a cpu in SPRG3.
This patch ensures that this information is also updated when the NUMA
affinity of a cpu changes.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The new PRRN firmware feature allows CPU and memory resources to be
transparently reassigned across NUMA boundaries. When this happens, the
kernel must update the node maps to reflect the new affinity information.
Although the NUMA maps can be protected by locking primitives during the
update itself, this is insufficient to prevent concurrent accesses to these
structures. Since cpumask_of_node() hands out a pointer to these
structures, they can still be modified outside of the lock. Furthermore,
tracking down each usage of these pointers and adding locks would be quite
invasive and difficult to maintain.
The approach used is to make a list of affected cpus and call stop_machine
to have the update routine run on each of the affected cpus allowing them
to update themselves. Each cpu finds itself in the list of cpus and makes
the appropriate updates. We need to have each cpu do this for themselves to
handle calls to vdso_getcpu_init() added in a subsequent patch.
Situations like these are best handled using stop_machine(). Since the NUMA
affinity updates are exceptionally rare events, this approach has the
benefit of not adding any overhead while accessing the NUMA maps during
normal operation.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Platform events such as partition migration or the new PRRN firmware
feature can cause the NUMA characteristics of a CPU to change, and these
changes will be reflected in the device tree nodes for the affected
CPUs.
This patch registers a handler for Open Firmware device tree updates
and reconfigures the CPU and node maps whenever the associativity
changes. Currently, this is accomplished by marking the affected CPUs in
the cpu_associativity_changes_mask and allowing
arch_update_cpu_topology() to retrieve the new associativity information
using hcall_vphn().
Protecting the NUMA cpu maps from concurrent access during an update
operation will be addressed in a subsequent patch in this series.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Update the numa code to use the updated firmware_has_feature() when checking
for type 1 affinity.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Move the logic trying to insert hpte in __hash_page_huge() to an helper
function, so it could also be used by others.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
It seems that new_pte and rflags don't get changed in the repeating loop, so
move their assignment out of the loop.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
This function has always been marked as __cpuinit, but is only called
from functions marked as __init and references an __initdata variable.
So change its annotation to __init.
Fixes this build warning:
WARNING: arch/powerpc/mm/built-in.o(.cpuinit.text+0x86): Section mismatch in reference from the function .fake_numa_create_new_node() to the variable .init.data:cmdline
The function __cpuinit .fake_numa_create_new_node() references
a variable __initdata cmdline.
If cmdline is only used by .fake_numa_create_new_node then
annotate cmdline with a matching annotation.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
The following commit breaks numa distance setup for old powerpc
systems that use form0 encoding in device tree.
commit 41eab6f88f
powerpc/numa: Use form 1 affinity to setup node distance
Device tree node /rtas/ibm,associativity-reference-points would
index into /cpus/PowerPCxxxx/ibm,associativity based on form0 or
form1 encoding detected by ibm,architecture-vec-5 property.
All modern systems use form1 and current kernel code is correct.
However, on older systems with form0 encoding, the numa distance
will get hard coded as LOCAL_DISTANCE for all nodes. This causes
task scheduling anomaly since scheduler will skip building numa
level domain (topmost domain with all cpus) if all numa distances
are same. (value of 'level' in sched_init_numa() will remain 0)
Prior to the above commit:
((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)
Restoring compatible behavior with this patch for old powerpc systems
with device tree where numa distance are encoded as form0.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Untested. As this typo was introduced in v3.3, with commit
9d67028090 ("powerpc: Split ICSWX ACOP and
PID processing"), which actually added PPC_ICSWX_PID, this surely needs
testing.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
This fixes the following checkpatch.pl warnings:
WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
+EXPORT_SYMBOL(kmap_prot);
WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
+EXPORT_SYMBOL(kmap_pte);
Signed-off-by: Valentina Manea <valentina.manea.m@gmail.com>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Now we use ESID_BITS of kernel address to build proto vsid. So rename
USER_ESIT_BITS to ESID_BITS
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org> [v3.8]
This patch change the kernel VSID range so that we limit VSID_BITS to 37.
This enables us to support 64TB with 65 bit VA (37+28). Without this patch
we have boot hangs on platforms that only support 65 bit VA.
With this patch we now have proto vsid generated as below:
We first generate a 37-bit "proto-VSID". Proto-VSIDs are generated
from mmu context id and effective segment id of the address.
For user processes max context id is limited to ((1ul << 19) - 5)
for kernel space, we use the top 4 context ids to map address as below
0x7fffc - [ 0xc000000000000000 - 0xc0003fffffffffff ]
0x7fffd - [ 0xd000000000000000 - 0xd0003fffffffffff ]
0x7fffe - [ 0xe000000000000000 - 0xe0003fffffffffff ]
0x7ffff - [ 0xf000000000000000 - 0xf0003fffffffffff ]
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org> [v3.8]
Commit f5339277eb accidentally removed
more than just iSeries bits and took out the call to stab_initialize()
thus breaking support for POWER3 processors.
Put it back. (Yes, nobody noticed until now ...)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org> [v3.4+]
The e6500 core used on T4240 and B4860 SoCs from FSL implements MMUv2 of
the Power Book-E Architecture. However there are some minor differences
between it and other Book-E implementations.
Add support to parse SPRN_TLB1PS for the variable page sizes supported.
In the future this should be expanded for more page sizes supported on
e6500 as well as other MMU features.
This patch is based on code from Scott Wood.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Merge second patch-bomb from Andrew Morton:
- A little DM fix
- the MM queue
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (154 commits)
ksm: allocate roots when needed
mm: cleanup "swapcache" in do_swap_page
mm,ksm: swapoff might need to copy
mm,ksm: FOLL_MIGRATION do migration_entry_wait
ksm: shrink 32-bit rmap_item back to 32 bytes
ksm: treat unstable nid like in stable tree
ksm: add some comments
tmpfs: fix mempolicy object leaks
tmpfs: fix use-after-free of mempolicy object
mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
mm: export mmu notifier invalidates
mm: accelerate mm_populate() treatment of THP pages
mm: use long type for page counts in mm_populate() and get_user_pages()
mm: accurately document nr_free_*_pages functions with code comments
HWPOISON: change order of error_states[]'s elements
HWPOISON: fix misjudgement of page_action() for errors on mlocked pages
memcg: stop warning on memcg_propagate_kmem
net: change type of virtio_chan->p9_max_pages
vmscan: change type of vm_total_pages to unsigned long
fs/nfsd: change type of max_delegations, nfsd_drc_max_mem and nfsd_drc_mem_used
...
Introduce a new API vmemmap_free() to free and remove vmemmap
pagetables. Since pagetable implements are different, each architecture
has to provide its own version of vmemmap_free(), just like
vmemmap_populate().
Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc.
[mhocko@suse.cz: fix implicit declaration of remove_pagetable]
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For removing memmap region of sparse-vmemmap which is allocated bootmem,
memmap region of sparse-vmemmap needs to be registered by
get_page_bootmem(). So the patch searches pages of virtual mapping and
registers the pages by get_page_bootmem().
NOTE: register_page_bootmem_memmap() is not implemented for ia64,
ppc, s390, and sparc. So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE
and revert register_page_bootmem_info_node() when platform doesn't
support it.
It's implemented by adding a new Kconfig option named
CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected
by memory-hotplug feature fully supported archs(currently only on
x86_64).
Since we have 2 config options called MEMORY_HOTPLUG and
MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately,
and codes in function register_page_bootmem_info_node() are only
used for collecting infomation for hot-remove, so reside it under
MEMORY_HOTREMOVE.
Besides page_isolation.c selected by MEMORY_ISOLATION under
MEMORY_HOTPLUG is also such case, move it too.
[mhocko@suse.cz: put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE]
[linfeng@cn.fujitsu.com: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node()]
[mhocko@suse.cz: remove the arch specific functions without any implementation]
[linfeng@cn.fujitsu.com: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed]
[rientjes@google.com: fix defined but not used warning]
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wu Jianguo <wujianguo@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For removing memory, we need to remove page tables. But it depends on
architecture. So the patch introduce arch_remove_memory() for removing
page table. Now it only calls __remove_pages().
Note: __remove_pages() for some archtecuture is not implemented
(I don't know how to implement it for s390).
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull powerpc updates from Benjamin Herrenschmidt:
"So from the depth of frozen Minnesota, here's the powerpc pull request
for 3.9. It has a few interesting highlights, in addition to the
usual bunch of bug fixes, minor updates, embedded device tree updates
and new boards:
- Hand tuned asm implementation of SHA1 (by Paulus & Michael
Ellerman)
- Support for Doorbell interrupts on Power8 (kind of fast
thread-thread IPIs) by Ian Munsie
- Long overdue cleanup of the way we handle relocation of our open
firmware trampoline (prom_init.c) on 64-bit by Anton Blanchard
- Support for saving/restoring & context switching the PPR (Processor
Priority Register) on server processors that support it. This
allows the kernel to preserve thread priorities established by
userspace. By Haren Myneni.
- DAWR (new watchpoint facility) support on Power8 by Michael Neuling
- Ability to change the DSCR (Data Stream Control Register) which
controls cache prefetching on a running process via ptrace by
Alexey Kardashevskiy
- Support for context switching the TAR register on Power8 (new
branch target register meant to be used by some new specific
userspace perf event interrupt facility which is yet to be enabled)
by Ian Munsie.
- Improve preservation of the CFAR register (which captures the
origin of a branch) on various exception conditions by Paulus.
- Move the Bestcomm DMA driver from arch powerpc to drivers/dma where
it belongs by Philippe De Muyter
- Support for Transactional Memory on Power8 by Michael Neuling
(based on original work by Matt Evans). For those curious about
the feature, the patch contains a pretty good description."
(See commit db8ff907027b: "powerpc: Documentation for transactional
memory on powerpc" for the mentioned description added to the file
Documentation/powerpc/transactional_memory.txt)
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (140 commits)
powerpc/kexec: Disable hard IRQ before kexec
powerpc/85xx: l2sram - Add compatible string for BSC9131 platform
powerpc/85xx: bsc9131 - Correct typo in SDHC device node
powerpc/e500/qemu-e500: enable coreint
powerpc/mpic: allow coreint to be determined by MPIC version
powerpc/fsl_pci: Store the pci ctlr device ptr in the pci ctlr struct
powerpc/85xx: Board support for ppa8548
powerpc/fsl: remove extraneous DIU platform functions
arch/powerpc/platforms/85xx/p1022_ds.c: adjust duplicate test
powerpc: Documentation for transactional memory on powerpc
powerpc: Add transactional memory to pseries and ppc64 defconfigs
powerpc: Add config option for transactional memory
powerpc: Add transactional memory to POWER8 cpu features
powerpc: Add new transactional memory state to the signal context
powerpc: Hook in new transactional memory code
powerpc: Routines for FP/VSX/VMX unavailable during a transaction
powerpc: Add transactional memory unavaliable execption handler
powerpc: Add reclaim and recheckpoint functions for context switching transactional memory processes
powerpc: Add FP/VSX and VMX register load functions for transactional memory
powerpc: Add helper functions for transactional memory context switching
...
This hooks the new transactional memory code into context switching, FP/VMX/VMX
unavailable and exception return.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The ASM version of hash computation function was truncating the upper bit.
Make the ASM version similar to hpt_hash function. Remove masking vsid bits.
Without this patch, we observed hang during bootup due to not satisfying page
fault request correctly. The fault handler used wrong hash values to update
the HPTE. Hence we kept looping with page fault.
hash_page(ea=000001003e260008, access=203, trap=300 ip=3fff91787134 dsisr 42000000
The computed value of hash 000000000f22f390
update: avpnv=4003e46054003e00, hash=000000000722f390, f=80000006, psize: 2 ...
BenH: The over-masking has been there for ever but only hurts with the
new 64T support introduced in 3.7
Reported-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: <stable@vger.kernel.org> [v3.7]
The only persistent change made by this loop is calling
memblock_set_node() once for each memblock, which is not useful (and has
no effect) as memblock_set_node() is not called with any
memblock-specific parameters.
Subsistute a single memblock_set_node().
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This is a rewrite so that we don't assume we are using the DABR throughout the
code. We now use the arch_hw_breakpoint to store the breakpoint in a generic
manner in the thread_struct, rather than storing the raw DABR value.
The ptrace GET/SET_DEBUGREG interface currently passes the raw DABR in from
userspace. We keep this functionality, so that future changes (like the POWER8
DAWR), will still fake the DABR to userspace.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Finally remove the two level TOC and build with -mcmodel=medium.
Unfortunately we can't build modules with -mcmodel=medium due to
the tricks the kernel module loader plays with percpu data:
# -mcmodel=medium breaks modules because it uses 32bit offsets from
# the TOC pointer to create pointers where possible. Pointers into the
# percpu data area are created by this method.
#
# The kernel module loader relocates the percpu data section from the
# original location (starting with 0xd...) to somewhere in the base
# kernel percpu data space (starting with 0xc...). We need a full
# 64bit relocation for this to work, hence -mcmodel=large.
On older kernels we fall back to the two level TOC (-mminimal-toc)
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
markings need to be removed.
This change removes the use of __devinit, __devexit_p, __devinitdata,
__devinitconst, and __devexit from these drivers.
Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.
Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull powerpc update from Benjamin Herrenschmidt:
"The main highlight is probably some base POWER8 support. There's more
to come such as transactional memory support but that will wait for
the next one.
Overall it's pretty quiet, or rather I've been pretty poor at picking
things up from patchwork and reviewing them this time around and Kumar
no better on the FSL side it seems..."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (73 commits)
powerpc+of: Rename and fix OF reconfig notifier error inject module
powerpc: mpc5200: Add a3m071 board support
powerpc/512x: don't compile any platform DIU code if the DIU is not enabled
powerpc/mpc52xx: use module_platform_driver macro
powerpc+of: Export of_reconfig_notifier_[register,unregister]
powerpc/dma/raidengine: add raidengine device
powerpc/iommu/fsl: Add PAMU bypass enable register to ccsr_guts struct
powerpc/mpc85xx: Change spin table to cached memory
powerpc/fsl-pci: Add PCI controller ATMU PM support
powerpc/86xx: fsl_pcibios_fixup_bus requires CONFIG_PCI
drivers/virt: the Freescale hypervisor driver doesn't need to check MSR[GS]
powerpc/85xx: p1022ds: Use NULL instead of 0 for pointers
powerpc: Disable relocation on exceptions when kexecing
powerpc: Enable relocation on during exceptions at boot
powerpc: Move get_longbusy_msecs into hvcall.h and remove duplicate function
powerpc: Add wrappers to enable/disable relocation on exceptions
powerpc: Add set_mode hcall
powerpc: Setup relocation on exceptions for bare metal systems
powerpc: Move initial mfspr LPCR out of __init_LPCR
powerpc: Add relocation on exception vector handlers
...
Merge misc VM changes from Andrew Morton:
"The rest of most-of-MM. The other MM bits await a slab merge.
This patch includes the addition of a huge zero_page. Not a
performance boost but it an save large amounts of physical memory in
some situations.
Also a bunch of Fujitsu engineers are working on memory hotplug.
Which, as it turns out, was badly broken. About half of their patches
are included here; the remainder are 3.8 material."
However, this merge disables CONFIG_MOVABLE_NODE, which was totally
broken. We don't add new features with "default y", nor do we add
Kconfig questions that are incomprehensible to most people without any
help text. Does the feature even make sense without compaction or
memory hotplug?
* akpm: (54 commits)
mm/bootmem.c: remove unused wrapper function reserve_bootmem_generic()
mm/memory.c: remove unused code from do_wp_page()
asm-generic, mm: pgtable: consolidate zero page helpers
mm/hugetlb.c: fix warning on freeing hwpoisoned hugepage
hwpoison, hugetlbfs: fix RSS-counter warning
hwpoison, hugetlbfs: fix "bad pmd" warning in unmapping hwpoisoned hugepage
mm: protect against concurrent vma expansion
memcg: do not check for mm in __mem_cgroup_count_vm_event
tmpfs: support SEEK_DATA and SEEK_HOLE (reprise)
mm: provide more accurate estimation of pages occupied by memmap
fs/buffer.c: remove redundant initialization in alloc_page_buffers()
fs/buffer.c: do not inline exported function
writeback: fix a typo in comment
mm: introduce new field "managed_pages" to struct zone
mm, oom: remove statically defined arch functions of same name
mm, oom: remove redundant sleep in pagefault oom handler
mm, oom: cleanup pagefault oom handler
memory_hotplug: allow online/offline memory to result movable node
numa: add CONFIG_MOVABLE_NODE for movable-dedicated node
mm, memcg: avoid unnecessary function call when memcg is disabled
...
out_of_memory() is a globally defined function to call the oom killer.
x86, sh, and powerpc all use a function of the same name within file scope
in their respective fault.c unnecessarily. Inline the functions into the
pagefault handlers to clean the code up.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"Whether" is misspelled in various comments across the tree; this
fixes them. No code changes.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Don't use 47x only #defines for TLBIVAX or ICBT, supply and use helpers
in ppc-opcode.h
This fixes a compile breakage.
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch moves the definition of the of_drconf_cell struct to asm/prom.h
to make it available for all powerpc/pseries code.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
.fault now can retry. The retry can break state machine of .fault. In
filemap_fault, if page is miss, ra->mmap_miss is increased. In the second
try, since the page is in page cache now, ra->mmap_miss is decreased. And
these are done in one fault, so we can't detect random mmap file access.
Add a new flag to indicate .fault is tried once. In the second try, skip
ra->mmap_miss decreasing. The filemap_fault state machine is ok with it.
I only tested x86, didn't test other archs, but looks the change for other
archs is obvious, but who knows :)
Signed-off-by: Shaohua Li <shaohua.li@fusionio.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull powerpc updates from Benjamin Herrenschmidt:
"Some highlights in addition to the usual batch of fixes:
- 64TB address space support for 64-bit processes by Aneesh Kumar
- Gavin Shan did a major cleanup & re-organization of our EEH support
code (IBM fancy PCI error handling & recovery infrastructure) which
paves the way for supporting different platform backends, along
with some rework of the PCIe code for the PowerNV platform in order
to remove home made resource allocations and instead use the
generic code (which is possible after some small improvements to it
done by Gavin).
- Uprobes support by Ananth N Mavinakayanahalli
- A pile of embedded updates from Freescale folks, including new SoC
and board supports, more KVM stuff including preparing for 64-bit
BookE KVM support, ePAPR 1.1 updates, etc..."
Fixup trivial conflicts in drivers/scsi/ipr.c
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (146 commits)
powerpc/iommu: Fix multiple issues with IOMMU pools code
powerpc: Fix VMX fix for memcpy case
driver/mtd:IFC NAND:Initialise internal SRAM before any write
powerpc/fsl-pci: use 'Header Type' to identify PCIE mode
powerpc/eeh: Don't release eeh_mutex in eeh_phb_pe_get
powerpc: Remove tlb batching hack for nighthawk
powerpc: Set paca->data_offset = 0 for boot cpu
powerpc/perf: Sample only if SIAR-Valid bit is set in P7+
powerpc/fsl-pci: fix warning when CONFIG_SWIOTLB is disabled
powerpc/mpc85xx: Update interrupt handling for IFC controller
powerpc/85xx: Enable USB support in p1023rds_defconfig
powerpc/smp: Do not disable IPI interrupts during suspend
powerpc/eeh: Fix crash on converting OF node to edev
powerpc/eeh: Lock module while handling EEH event
powerpc/kprobe: Don't emulate store when kprobe stwu r1
powerpc/kprobe: Complete kprobe and migrate exception frame
powerpc/kprobe: Introduce a new thread flag
powerpc: Remove unused __get_user64() and __put_user64()
powerpc/eeh: Global mutex to protect PE tree
powerpc/eeh: Remove EEH PE for normal PCI hotplug
...
Pull user namespace changes from Eric Biederman:
"This is a mostly modest set of changes to enable basic user namespace
support. This allows the code to code to compile with user namespaces
enabled and removes the assumption there is only the initial user
namespace. Everything is converted except for the most complex of the
filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
nfs, ocfs2 and xfs as those patches need a bit more review.
The strategy is to push kuid_t and kgid_t values are far down into
subsystems and filesystems as reasonable. Leaving the make_kuid and
from_kuid operations to happen at the edge of userspace, as the values
come off the disk, and as the values come in from the network.
Letting compile type incompatible compile errors (present when user
namespaces are enabled) guide me to find the issues.
The most tricky areas have been the places where we had an implicit
union of uid and gid values and were storing them in an unsigned int.
Those places were converted into explicit unions. I made certain to
handle those places with simple trivial patches.
Out of that work I discovered we have generic interfaces for storing
quota by projid. I had never heard of the project identifiers before.
Adding full user namespace support for project identifiers accounts
for most of the code size growth in my git tree.
Ultimately there will be work to relax privlige checks from
"capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
root in a user names to do those things that today we only forbid to
non-root users because it will confuse suid root applications.
While I was pushing kuid_t and kgid_t changes deep into the audit code
I made a few other cleanups. I capitalized on the fact we process
netlink messages in the context of the message sender. I removed
usage of NETLINK_CRED, and started directly using current->tty.
Some of these patches have also made it into maintainer trees, with no
problems from identical code from different trees showing up in
linux-next.
After reading through all of this code I feel like I might be able to
win a game of kernel trivial pursuit."
Fix up some fairly trivial conflicts in netfilter uid/git logging code.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
userns: Convert the ufs filesystem to use kuid/kgid where appropriate
userns: Convert the udf filesystem to use kuid/kgid where appropriate
userns: Convert ubifs to use kuid/kgid
userns: Convert squashfs to use kuid/kgid where appropriate
userns: Convert reiserfs to use kuid and kgid where appropriate
userns: Convert jfs to use kuid/kgid where appropriate
userns: Convert jffs2 to use kuid and kgid where appropriate
userns: Convert hpfs to use kuid and kgid where appropriate
userns: Convert btrfs to use kuid/kgid where appropriate
userns: Convert bfs to use kuid/kgid where appropriate
userns: Convert affs to use kuid/kgid wherwe appropriate
userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
userns: On ia64 deal with current_uid and current_gid being kuid and kgid
userns: On ppc convert current_uid from a kuid before printing.
userns: Convert s390 getting uid and gid system calls to use kuid and kgid
userns: Convert s390 hypfs to use kuid and kgid where appropriate
userns: Convert binder ipc to use kuids
userns: Teach security_path_chown to take kuids and kgids
userns: Add user namespace support to IMA
userns: Convert EVM to deal with kuids and kgids in it's hmac computation
...
In hpte_init_native() we call tlb_batching_enabled() to decide if we
should setup ppc_md.flush_hash_range.
tlb_batching_enabled() checks the _unflattened_ device tree, to see
if we are running on a nighthawk.
Since commit a223535 ("dont allow pSeries_probe to succeed without
initialising MMU", Dec 2006), hpte_init_native() has been called from
pSeries_probe() - at which point we have not yet unflattened the
device tree.
This means tlb_batching_enabled() will always return true, so the hack
has effectively been disabled since Dec 2006. Ergo, I think we can
drop it.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This update the proto-VSID and VSID scramble related information
to be more generic by using names instead of current values.
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Increase max addressable range to 64TB. This is not tested on
real hardware yet.
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
With larger vsid we need to track more bits of ESID in slb cache
for slb invalidate.
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
ASM_VSID_SCRAMBLE can leave non-zero bits in the high 28 bits of the result
for 256MB segment (40 bits for 1T segment). Properly mask them before using
the values in slbmte
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch makes the high psizes mask as an unsigned char array
so that we can have more than 16TB. Currently we support upto
64TB
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch convert different functions to take virtual page number
instead of virtual address. Virtual page number is virtual address
shifted right by VPN_SHIFT (12) bits. This enable us to have an
address range of upto 76 bits.
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch simplify hpte_decode for easy switching of virtual address to
virtual page number in the later patch
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To clarify the meaning for future readers, replace the open coded
19 with CONTEXT_BITS
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Remove the dependency on PCI initialization for SWIOTLB initialization.
So that PCI can be initialized at proper time.
SWIOTLB is partly determined by PCI inbound/outbound map which is assigned
in PCI initialization. But swiotlb_init() should be done at the stage of
mem_init() which is much earlier than PCI initialization. So we reserve the
memory for SWIOTLB first and free it if not necessary.
All boards are converted to fit this change.
Signed-off-by: Jia Hongtao <B38951@freescale.com>
Signed-off-by: Li Yang <leoli@freescale.com>
Acked-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
sys_subpage_prot() takes an unsigned long for 'addr' then does some stuff
with it and the result is stored in a signed int, i, which is eventually
used as the size parameter in a copy_from_user call. Update 'i' to be an
unsigned long as well and since 'nw' is used in a size_t context which,
depending on whether this is 32- or 64-bit may be unsigned int or unsigned
long, switch that to a size_t and always be right.
Finally, since we're in the neighbourhood, make the same changes to
subpage_prot_clear().
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Joe MacDonald <joe.macdonald@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There are some device-tree nodes, whose values are of type phys_addr_t.
The phys_addr_t is variable sized based on the CONFIG_PHSY_T_64BIT.
Change these to a fixed unsigned long long for consistency.
This patch does the change only for memory_limit.
The following is a list of such variables which need the change:
1) kernel_end, crashk_size - in arch/powerpc/kernel/machine_kexec.c
2) (struct resource *)crashk_res.start - We could export a local static
variable from machine_kexec.c.
Changing the above values might break the kexec-tools. So, I will
fix kexec-tools first to handle the different sized values and then change
the above.
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
arch_update_cpu_topology() should only return 1 when the topology has
actually changed, and should return 0 otherwise.
This patch fixes a potential bug where rebuild_sched_domains() would
reinitialize the sched domains even when the topology hasn't changed.
Signed-off-by: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Embedded.Hypervisor category defines GSPRG0..3 physical registers for guests.
Avoid SPRG4-7 usage as scratch in host exception handlers, otherwise guest
SPRG4-7 registers will be clobbered.
For bolted TLB miss exception handlers, which is the version currently
supported by KVM, use SPRN_SPRG_GEN_SCRATCH aka SPRG0 instead of
SPRN_SPRG_TLB_SCRATCH aka SPRG6. Keep using TLB PACA slots to fit in one
64-byte cache line.
For critical exception handlers use SPRG3 instead of SPRG7. Provide a routine
to store and restore user-visible SPRGs. This will be subsequently used
to restore VDSO information in SPRG3. Add EX_R13 to paca slots to free up
SPRG3 and change the critical exception epilog to use it.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Hook DO_KVM macro into 64-bit booke for KVM integration. Extend interrupt
handlers' parameter list with interrupt vector numbers to accomodate the macro.
Only the bolted version of tlb miss handers is addressed now.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add thread_struct.trap_nr and use it to store the last exception
the thread experienced. In this patch, we populate the field at
various places where we force_sig_info() to the process.
This is also used in uprobes to determine if the probed instruction
caused an exception.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
It's empty now, apart from other includes.
Fixup a few files that were getting things via this header.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
These days they are just __va() and __pa() respectively.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
abs_to_virt() is just a wrapper around __va(), call __va() directly.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When we map a page that wasn't icache cleared before, do so when first
mapping it in KVM using the same information bits as the Linux mapping
logic. That way we are 100% sure that any page we map does not have stale
entries in the icache.
Signed-off-by: Alexander Graf <agraf@suse.de>
Some macros use RA where when RA=R0 the values is 0, so make this
the enforced mnemonic in the macro.
Idea suggested by Andreas Schwab.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
These macros are using integers where they could be using logical
names since they take registers.
We are going to enforce this soon, so fix these up now.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Merge the defines of STACKFRAMESIZE, STK_REG, STK_PARAM from different
places.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anything that uses a constructed instruction (ie. from ppc-opcode.h),
need to use the new R0 macro, as %r0 is not going to work.
Also convert usages of macros where we are just determining an offset
(usually for a load/store), like:
std r14,STK_REG(r14)(r1)
Can't use STK_REG(r14) as %r14 doesn't work in the STK_REG macro since
it's just calculating an offset.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Newer gcc are being a bit blind here (it's pretty obvious we don't
reach the code path using the array if we haven't initialized the
pointer) but none of that is performance critical so let's just
silence it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The form affinity for NUMA is set to 1 if the firmware supports
OPAL. Otherwise, we have to retrieve that from OF node "/chosen".
For the latter case, OF node "/chosen" reference count was never
decreased.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
chroma_defconfig currently gives me this with gcc 4.6:
arch/powerpc/mm/numa.c:638:13: error: 'dm' may be used uninitialized in this function [-Werror=uninitialized]
It's a bogus warning/error since of_get_drconf_memory() only writes it
anyway.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: <stable@kernel.org> [v3.3+]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Current CPU hotplug code has some task->mm handling issues:
1. Working with task->mm w/o getting mm or grabing the task lock is
dangerous as ->mm might disappear (exit_mm() assigns NULL under
task_lock(), so tasklist lock is not enough).
We can't use get_task_mm()/mmput() pair as mmput() might sleep,
so we must take the task lock while handle its mm.
2. Checking for process->mm is not enough because process' main
thread may exit or detach its mm via use_mm(), but other threads
may still have a valid mm.
To fix this we would need to use find_lock_task_mm(), which would
walk up all threads and returns an appropriate task (with task
lock held).
clear_tasks_mm_cpumask() has all the issues fixed, so let's use it.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 9fb48c744b
"params: add 3rd arg to option handler callback signature"
added an extra arg to the function, but didn't catch all the use
cases needing it, causing this compile fail in mpc85xx_defconfig:
arch/powerpc/mm/hugetlbpage.c:316:4: error: passing argument 7 of
'parse_args' from incompatible pointer type [-Werror]
include/linux/moduleparam.h:317:12: note: expected
'int (*)(char *, char *, const char *)' but argument is of type
'int (*)(char *, char *)'
This function has no need to printk out the "doing" value, so
just add the arg as an "unused".
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Becky Bruce <beckyb@kernel.crashing.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIVAwUAT3NKzROxKuMESys7AQKElw/+JyDxJSlj+g+nymkx8IVVuU8CsEwNLgRk
8KEnRfLhGtkXFLSJYWO6jzGo16F8Uqli1PdMFte/wagSv0285/HZaKlkkBVHdJ/m
u40oSjgT013bBh6MQ0Oaf8pFezFUiQB5zPOA9QGaLVGDLXCmgqUgd7exaD5wRIwB
ZmyItjZeAVnDfk1R+ZiNYytHAi8A5wSB+eFDCIQYgyulA1Igd1UnRtx+dRKbvc/m
rWQ6KWbZHIdvP1ksd8wHHkrlUD2pEeJ8glJLsZUhMm/5oMf/8RmOCvmo8rvE/qwl
eDQ1h4cGYlfjobxXZMHqAN9m7Jg2bI946HZjdb7/7oCeO6VW3FwPZ/Ic75p+wp45
HXJTItufERYk6QxShiOKvA+QexnYwY0IT5oRP4DrhdVB/X9cl2MoaZHC+RbYLQy+
/5VNZKi38iK4F9AbFamS7kd0i5QszA/ZzEzKZ6VMuOp3W/fagpn4ZJT1LIA3m4A9
Q0cj24mqeyCfjysu0TMbPtaN+Yjeu1o1OFRvM8XffbZsp5bNzuTDEvviJ2NXw4vK
4qUHulhYSEWcu9YgAZXvEWDEM78FXCkg2v/CrZXH5tyc95kUkMPcgG+QZBB5wElR
FaOKpiC/BuNIGEf02IZQ4nfDxE90QwnDeoYeV+FvNj9UEOopJ5z5bMPoTHxm4cCD
NypQthI85pc=
=G9mT
-----END PGP SIGNATURE-----
Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system
Pull "Disintegrate and delete asm/system.h" from David Howells:
"Here are a bunch of patches to disintegrate asm/system.h into a set of
separate bits to relieve the problem of circular inclusion
dependencies.
I've built all the working defconfigs from all the arches that I can
and made sure that they don't break.
The reason for these patches is that I recently encountered a circular
dependency problem that came about when I produced some patches to
optimise get_order() by rewriting it to use ilog2().
This uses bitops - and on the SH arch asm/bitops.h drags in
asm-generic/get_order.h by a circuituous route involving asm/system.h.
The main difficulty seems to be asm/system.h. It holds a number of
low level bits with no/few dependencies that are commonly used (eg.
memory barriers) and a number of bits with more dependencies that
aren't used in many places (eg. switch_to()).
These patches break asm/system.h up into the following core pieces:
(1) asm/barrier.h
Move memory barriers here. This already done for MIPS and Alpha.
(2) asm/switch_to.h
Move switch_to() and related stuff here.
(3) asm/exec.h
Move arch_align_stack() here. Other process execution related bits
could perhaps go here from asm/processor.h.
(4) asm/cmpxchg.h
Move xchg() and cmpxchg() here as they're full word atomic ops and
frequently used by atomic_xchg() and atomic_cmpxchg().
(5) asm/bug.h
Move die() and related bits.
(6) asm/auxvec.h
Move AT_VECTOR_SIZE_ARCH here.
Other arch headers are created as needed on a per-arch basis."
Fixed up some conflicts from other header file cleanups and moving code
around that has happened in the meantime, so David's testing is somewhat
weakened by that. We'll find out anything that got broken and fix it..
* tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
Delete all instances of asm/system.h
Remove all #inclusions of asm/system.h
Add #includes needed to permit the removal of asm/system.h
Move all declarations of free_initmem() to linux/mm.h
Disintegrate asm/system.h for OpenRISC
Split arch_align_stack() out from asm-generic/system.h
Split the switch_to() wrapper out of asm-generic/system.h
Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
Create asm-generic/barrier.h
Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
Disintegrate asm/system.h for Xtensa
Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
Disintegrate asm/system.h for Tile
Disintegrate asm/system.h for Sparc
Disintegrate asm/system.h for SH
Disintegrate asm/system.h for Score
Disintegrate asm/system.h for S390
Disintegrate asm/system.h for PowerPC
Disintegrate asm/system.h for PA-RISC
Disintegrate asm/system.h for MN10300
...
Pull kvm updates from Avi Kivity:
"Changes include timekeeping improvements, support for assigning host
PCI devices that share interrupt lines, s390 user-controlled guests, a
large ppc update, and random fixes."
This is with the sign-off's fixed, hopefully next merge window we won't
have rebased commits.
* 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits)
KVM: Convert intx_mask_lock to spin lock
KVM: x86: fix kvm_write_tsc() TSC matching thinko
x86: kvmclock: abstract save/restore sched_clock_state
KVM: nVMX: Fix erroneous exception bitmap check
KVM: Ignore the writes to MSR_K7_HWCR(3)
KVM: MMU: make use of ->root_level in reset_rsvds_bits_mask
KVM: PMU: add proper support for fixed counter 2
KVM: PMU: Fix raw event check
KVM: PMU: warn when pin control is set in eventsel msr
KVM: VMX: Fix delayed load of shared MSRs
KVM: use correct tlbs dirty type in cmpxchg
KVM: Allow host IRQ sharing for assigned PCI 2.3 devices
KVM: Ensure all vcpus are consistent with in-kernel irqchip settings
KVM: x86 emulator: Allow PM/VM86 switch during task switch
KVM: SVM: Fix CPL updates
KVM: x86 emulator: VM86 segments must have DPL 3
KVM: x86 emulator: Fix task switch privilege checks
arch/powerpc/kvm/book3s_hv.c: included linux/sched.h twice
KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation
KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
...
Disintegrate asm/system.h for PowerPC.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
cc: linuxppc-dev@lists.ozlabs.org
This patch adds a set of macros that can be used to declare
kernel parameters to be parsed _before_ initcalls at a chosen
level are executed. We rename the now-unused "flags" field of
struct kernel_param as the level. It's signed, for when we
use this for early params as well, in future.
Linker macro collating init calls had to be modified in order
to add additional symbols between levels that are later used
by the init code to split the calls into blocks.
Signed-off-by: Pawel Moll <pawel.moll@arm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Pull powerpc merge from Benjamin Herrenschmidt:
"Here's the powerpc batch for this merge window. It is going to be a
bit more nasty than usual as in touching things outside of
arch/powerpc mostly due to the big iSeriesectomy :-) We finally got
rid of the bugger (legacy iSeries support) which was a PITA to
maintain and that nobody really used anymore.
Here are some of the highlights:
- Legacy iSeries is gone. Thanks Stephen ! There's still some bits
and pieces remaining if you do a grep -ir series arch/powerpc but
they are harmless and will be removed in the next few weeks
hopefully.
- The 'fadump' functionality (Firmware Assisted Dump) replaces the
previous (equivalent) "pHyp assisted dump"... it's a rewrite of a
mechanism to get the hypervisor to do crash dumps on pSeries, the
new implementation hopefully being much more reliable. Thanks
Mahesh Salgaonkar.
- The "EEH" code (pSeries PCI error handling & recovery) got a big
spring cleaning, motivated by the need to be able to implement a
new backend for it on top of some new different type of firwmare.
The work isn't complete yet, but a good chunk of the cleanups is
there. Note that this adds a field to struct device_node which is
not very nice and which Grant objects to. I will have a patch soon
that moves that to a powerpc private data structure (hopefully
before rc1) and we'll improve things further later on (hopefully
getting rid of the need for that pointer completely). Thanks Gavin
Shan.
- I dug into our exception & interrupt handling code to improve the
way we do lazy interrupt handling (and make it work properly with
"edge" triggered interrupt sources), and while at it found & fixed
a wagon of issues in those areas, including adding support for page
fault retry & fatal signals on page faults.
- Your usual random batch of small fixes & updates, including a bunch
of new embedded boards, both Freescale and APM based ones, etc..."
I fixed up some conflicts with the generalized irq-domain changes from
Grant Likely, hopefully correctly.
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (141 commits)
powerpc/ps3: Do not adjust the wrapper load address
powerpc: Remove the rest of the legacy iSeries include files
powerpc: Remove the remaining CONFIG_PPC_ISERIES pieces
init: Remove CONFIG_PPC_ISERIES
powerpc: Remove FW_FEATURE ISERIES from arch code
tty/hvc_vio: FW_FEATURE_ISERIES is no longer selectable
powerpc/spufs: Fix double unlocks
powerpc/5200: convert mpc5200 to use of_platform_populate()
powerpc/mpc5200: add options to mpc5200_defconfig
powerpc/mpc52xx: add a4m072 board support
powerpc/mpc5200: update mpc5200_defconfig to fit for charon board
Documentation/powerpc/mpc52xx.txt: Checkpatch cleanup
powerpc/44x: Add additional device support for APM821xx SoC and Bluestone board
powerpc/44x: Add support PCI-E for APM821xx SoC and Bluestone board
MAINTAINERS: Update PowerPC 4xx tree
powerpc/44x: The bug fixed support for APM821xx SoC and Bluestone board
powerpc: document the FSL MPIC message register binding
powerpc: add support for MPIC message register API
powerpc/fsl: Added aliased MSIIR register address to MSI node in dts
powerpc/85xx: mpc8548cds - add 36-bit dts
...
This is no longer selectable, so just remove all the dependent code.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The registers that describe size supported by TLB are different on MMU
v2 as well as we support power of two page sizes. For now we continue
to assume that FSL variable size array supports all page sizes up to the
maximum one reported in TLB1PS.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Other architectures such as x86 and ARM have been growing
new support for features like retrying page faults after
dropping the mm semaphore to break contention, or being
able to return from a stuck page fault when a SIGKILL is
pending.
This refactors our implementation of do_page_fault() to
move the error handling out of line in a way similar to
x86 and adds support for those two features.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We currently turn interrupts back to their previous state before
calling do_page_fault(). This can be annoying when debugging as
a bad fault will potentially have lost some processor state before
getting into the debugger.
We also end up calling some generic code with interrupts enabled
such as notify_page_fault() with interrupts enabled, which could
be unexpected.
This changes our code to behave more like other architectures,
and make the assembly entry code call into do_page_faults() with
interrupts disabled. They are conditionally re-enabled from
within do_page_fault() in the same spot x86 does it.
While there, add the might_sleep() test in the case of a successful
trylock of the mmap semaphore, again like x86.
Also fix a bug in the existing assembly where r12 (_MSR) could get
clobbered by C calls (the DTL accounting in the exception common
macro and DISABLE_INTS) in some cases.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
v2. Add the r12 clobber fix
This removes the various bits of assembly in the kernel entry,
exception handling and SLB management code that were specific
to running under the legacy iSeries hypervisor which is no
longer supported.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Emit the function name not the address when possible.
builtin_return_address() gives an address. When building
a kernel with CONFIG_KALLSYMS, emit the actual function
name not the address.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There is a race where a thread causes a coprocessor type to be valid
in its own ACOP _and_ in the current context, but it does not
propagate to the ACOP register of other threads in time for them to
use it. The original code tries to solve this by sending an IPI to
all threads on the system, which is heavy handed, but unfortunately
still provides a window where the icswx is issued by other threads and
the ACOP is not up to date.
This patch detects that the ACOP DSI fault was a "false positive" and
syncs the ACOP and causes the icswx to be replayed.
Signed-off-by: Jimi Xenidis <jimix@pobox.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds the infrastructure to enable us to page out pages underneath
a Book3S HV guest, on processors that support virtualized partition
memory, that is, POWER7. Instead of pinning all the guest's pages,
we now look in the host userspace Linux page tables to find the
mapping for a given guest page. Then, if the userspace Linux PTE
gets invalidated, kvm_unmap_hva() gets called for that address, and
we replace all the guest HPTEs that refer to that page with absent
HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
set, which will cause an HDSI when the guest tries to access them.
Finally, the page fault handler is extended to reinstantiate the
guest HPTE when the guest tries to access a page which has been paged
out.
Since we can't intercept the guest DSI and ISI interrupts on PPC970,
we still have to pin all the guest pages on PPC970. We have a new flag,
kvm->arch.using_mmu_notifiers, that indicates whether we can page
guest pages out. If it is not set, the MMU notifier callbacks do
nothing and everything operates as before.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
On 2012-02-20 11:02:51 Mon, Paul Mackerras wrote:
> On Thu, Feb 16, 2012 at 04:44:30PM +0530, Mahesh J Salgaonkar wrote:
>
> If I have read the code correctly, we are going to get this printk on
> non-pSeries machines or on older pSeries machines, even if the user
> has not put the fadump=on option on the kernel command line. The
> printk will be annoying since there is no actual error condition. It
> seems to me that the condition for the printk should include
> fw_dump.fadump_enabled. In other words you should probably add
>
> if (!fw_dump.fadump_enabled)
> return 0;
>
> at the beginning of the function.
Hi Paul,
Thanks for pointing it out. Please find the updated patch below.
The existing patches above this (4/10 through 10/10) cleanly applies
on this update.
Thanks,
-Mahesh.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
node_to_cpumask() has been replaced by cpumask_of_node(), and wholly
removed since commit 29c337a0 ("cpumask: remove obsolete node_to_cpumask
now everyone uses cpumask_of_node").
So update the comments for setup_node_to_cpumask_map().
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (53 commits)
Kconfig: acpi: Fix typo in comment.
misc latin1 to utf8 conversions
devres: Fix a typo in devm_kfree comment
btrfs: free-space-cache.c: remove extra semicolon.
fat: Spelling s/obsolate/obsolete/g
SCSI, pmcraid: Fix spelling error in a pmcraid_err() call
tools/power turbostat: update fields in manpage
mac80211: drop spelling fix
types.h: fix comment spelling for 'architectures'
typo fixes: aera -> area, exntension -> extension
devices.txt: Fix typo of 'VMware'.
sis900: Fix enum typo 'sis900_rx_bufer_status'
decompress_bunzip2: remove invalid vi modeline
treewide: Fix comment and string typo 'bufer'
hyper-v: Update MAINTAINERS
treewide: Fix typos in various parts of the kernel, and fix some comments.
clockevents: drop unknown Kconfig symbol GENERIC_CLOCKEVENTS_MIGR
gpio: Kconfig: drop unknown symbol 'CS5535_GPIO'
leds: Kconfig: Fix typo 'D2NET_V2'
sound: Kconfig: drop unknown symbol ARCH_CLPS7500
...
Fix up trivial conflicts in arch/powerpc/platforms/40x/Kconfig (some new
kconfig additions, close to removed commented-out old ones)
* 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (73 commits)
arm: fix up some samsung merge sysdev conversion problems
firmware: Fix an oops on reading fw_priv->fw in sysfs loading file
Drivers:hv: Fix a bug in vmbus_driver_unregister()
driver core: remove __must_check from device_create_file
debugfs: add missing #ifdef HAS_IOMEM
arm: time.h: remove device.h #include
driver-core: remove sysdev.h usage.
clockevents: remove sysdev.h
arm: convert sysdev_class to a regular subsystem
arm: leds: convert sysdev_class to a regular subsystem
kobject: remove kset_find_obj_hinted()
m86k: gpio - convert sysdev_class to a regular subsystem
mips: txx9_sram - convert sysdev_class to a regular subsystem
mips: 7segled - convert sysdev_class to a regular subsystem
sh: dma - convert sysdev_class to a regular subsystem
sh: intc - convert sysdev_class to a regular subsystem
power: suspend - convert sysdev_class to a regular subsystem
power: qe_ic - convert sysdev_class to a regular subsystem
power: cmm - convert sysdev_class to a regular subsystem
s390: time - convert sysdev_class to a regular subsystem
...
Fix up conflicts with 'struct sysdev' removal from various platform
drivers that got changed:
- arch/arm/mach-exynos/cpu.c
- arch/arm/mach-exynos/irq-eint.c
- arch/arm/mach-s3c64xx/common.c
- arch/arm/mach-s3c64xx/cpu.c
- arch/arm/mach-s5p64x0/cpu.c
- arch/arm/mach-s5pv210/common.c
- arch/arm/plat-samsung/include/plat/cpu.h
- arch/powerpc/kernel/sysfs.c
and fix up cpu_is_hotpluggable() as per Greg in include/linux/cpu.h
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (185 commits)
powerpc: fix compile error with 85xx/p1010rdb.c
powerpc: fix compile error with 85xx/p1023_rds.c
powerpc/fsl: add MSI support for the Freescale hypervisor
arch/powerpc/sysdev/fsl_rmu.c: introduce missing kfree
powerpc/fsl: Add support for Integrated Flash Controller
powerpc/fsl: update compatiable on fsl 16550 uart nodes
powerpc/85xx: fix PCI and localbus properties in p1022ds.dts
powerpc/85xx: re-enable ePAPR byte channel driver in corenet32_smp_defconfig
powerpc/fsl: Update defconfigs to enable some standard FSL HW features
powerpc: Add TBI PHY node to first MDIO bus
sbc834x: put full compat string in board match check
powerpc/fsl-pci: Allow 64-bit PCIe devices to DMA to any memory address
powerpc: Fix unpaired probe_hcall_entry and probe_hcall_exit
offb: Fix setting of the pseudo-palette for >8bpp
offb: Add palette hack for qemu "standard vga" framebuffer
offb: Fix bug in calculating requested vram size
powerpc/boot: Change the WARN to INFO for boot wrapper overlap message
powerpc/44x: Fix build error on currituck platform
powerpc/boot: Change the load address for the wrapper to fit the kernel
powerpc/44x: Enable CRASH_DUMP for 440x
...
Fix up a trivial conflict in arch/powerpc/include/asm/cputime.h due to
the additional sparse-checking code for cputime_t.
This resolves the conflict in the arch/arm/mach-s3c64xx/s3c6400.c file,
and it fixes the build error in the arch/x86/kernel/microcode_core.c
file, that the merge did not catch.
The microcode_core.c patch was provided by Stephen Rothwell
<sfr@canb.auug.org.au> who was invaluable in the merge issues involved
with the large sysdev removal process in the driver-core tree.
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
and converts the devices to regular devices. The sysdev drivers are
implemented as subsystem interfaces now.
After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.
Userspace relies on events and generic sysfs subsystem infrastructure
from sysdev devices, which are made available with this conversion.
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@amd64.org>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Len Brown <lenb@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We find the runtime address of _stext and relocate ourselves based
on the following calculation.
virtual_base = ALIGN(KERNELBASE,KERNEL_TLB_PIN_SIZE) +
MODULO(_stext.run,KERNEL_TLB_PIN_SIZE)
relocate() is called with the Effective Virtual Base Address (as
shown below)
| Phys. Addr| Virt. Addr |
Page |------------------------|
Boundary | | |
| | |
| | |
Kernel Load |___________|_ __ _ _ _ _|<- Effective
Addr(_stext)| | ^ |Virt. Base Addr
| | | |
| | | |
| |reloc_offset|
| | | |
| | | |
| |______v_____|<-(KERNELBASE)%TLB_SIZE
| | |
| | |
| | |
Page |-----------|------------|
Boundary | | |
On BookE, we need __va() & __pa() early in the boot process to access
the device tree.
Currently this has been defined as :
#define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) -
PHYSICAL_START + KERNELBASE)
where:
PHYSICAL_START is kernstart_addr - a variable updated at runtime.
KERNELBASE is the compile time Virtual base address of kernel.
This won't work for us, as kernstart_addr is dynamic and will yield different
results for __va()/__pa() for same mapping.
e.g.,
Let the kernel be loaded at 64MB and KERNELBASE be 0xc0000000 (same as
PAGE_OFFSET).
In this case, we would be mapping 0 to 0xc0000000, and kernstart_addr = 64M
Now __va(1MB) = (0x100000) - (0x4000000) + 0xc0000000
= 0xbc100000 , which is wrong.
it should be : 0xc0000000 + 0x100000 = 0xc0100000
On platforms which support AMP, like PPC_47x (based on 44x), the kernel
could be loaded at highmem. Hence we cannot always depend on the compile
time constants for mapping.
Here are the possible solutions:
1) Update kernstart_addr(PHSYICAL_START) to match the Physical address of
compile time KERNELBASE value, instead of the actual Physical_Address(_stext).
The disadvantage is that we may break other users of PHYSICAL_START. They
could be replaced with __pa(_stext).
2) Redefine __va() & __pa() with relocation offset
#ifdef CONFIG_RELOCATABLE_PPC32
#define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) - PHYSICAL_START + (KERNELBASE + RELOC_OFFSET)))
#define __pa(x) ((unsigned long)(x) + PHYSICAL_START - (KERNELBASE + RELOC_OFFSET))
#endif
where, RELOC_OFFSET could be
a) A variable, say relocation_offset (like kernstart_addr), updated
at boot time. This impacts performance, as we have to load an additional
variable from memory.
OR
b) #define RELOC_OFFSET ((PHYSICAL_START & PPC_PIN_SIZE_OFFSET_MASK) - \
(KERNELBASE & PPC_PIN_SIZE_OFFSET_MASK))
This introduces more calculations for doing the translation.
3) Redefine __va() & __pa() with a new variable
i.e,
#define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) + VIRT_PHYS_OFFSET))
where VIRT_PHYS_OFFSET :
#ifdef CONFIG_RELOCATABLE_PPC32
#define VIRT_PHYS_OFFSET virt_phys_offset
#else
#define VIRT_PHYS_OFFSET (KERNELBASE - PHYSICAL_START)
#endif /* CONFIG_RELOCATABLE_PPC32 */
where virt_phy_offset is updated at runtime to :
Effective KERNELBASE - kernstart_addr.
Taking our example, above:
virt_phys_offset = effective_kernelstart_vaddr - kernstart_addr
= 0xc0400000 - 0x400000
= 0xc0000000
and
__va(0x100000) = 0xc0000000 + 0x100000 = 0xc0100000
which is what we want.
I have implemented (3) in the following patch which has same cost of
operation as the existing one.
I have tested the patches on 440x platforms only. However this should
work fine for PPC_47x also, as we only depend on the runtime address
and the current TLB XLAT entry for the startup code, which is available
in r25. I don't have access to a 47x board yet. So, it would be great if
somebody could test this on 47x.
Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Signed-off-by: Josh Boyer <jwboyer@gmail.com>
The current implementation of CONFIG_RELOCATABLE in BookE is based
on mapping the page aligned kernel load address to KERNELBASE. This
approach however is not enough for platforms, where the TLB page size
is large (e.g, 256M on 44x). So we are renaming the RELOCATABLE used
currently in BookE to DYNAMIC_MEMSTART to reflect the actual method.
The CONFIG_RELOCATABLE for PPC32(BookE) based on processing of the
dynamic relocations will be introduced in the later in the patch series.
This change would allow the use of the old method of RELOCATABLE for
platforms which can afford to enforce the page alignment (platforms with
smaller TLB size).
Changes since v3:
* Introduced a new config, NONSTATIC_KERNEL, to denote a kernel which is
either a RELOCATABLE or DYNAMIC_MEMSTART(Suggested by: Josh Boyer)
Suggested-by: Scott Wood <scottwood@freescale.com>
Tested-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Cc: Scott Wood <scottwood@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Josh Boyer <jwboyer@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linux ppc dev <linuxppc-dev@lists.ozlabs.org>
Signed-off-by: Josh Boyer <jwboyer@gmail.com>
read_n_cells() cannot be marked as .devinit.text since it is referenced
from two functions that are not in that section: of_get_lmb_size() and
hot_add_drconf_scn_to_nid().
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
mark_reserved_regions_for_nid() is only called from do_init_bootmem(),
which is in .init.text, so it must be in the same section to avoid a
section mismatch warning.
Reported-by: Subrata Modak <subrata@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CONFIG_PPC47x doesn't exist in Kconfig and no 476 processor calls this
function ppc44x_pin_tlb() as it has it's own ppc47x_pin_tlb().
This code is probably an artifact of the original 476 code that
shouldn't have made it upstream.
Signed-off-by: Christoph Egger <siccegge@cs.fau.de>
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Josh Boyer <jwboyer@gmail.com>
powerpc doesn't access early_node_map[] directly and enabling
HAVE_MEMBLOCK_NODE_MAP is trivial - replacing add_active_range() calls
with memblock_set_node() and selecting HAVE_MEMBLOCK_NODE_MAP is
enough.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
The only function of memblock_analyze() is now allowing resize of
memblock region arrays. Rename it to memblock_allow_resize() and
update its users.
* The following users remain the same other than renaming.
arm/mm/init.c::arm_memblock_init()
microblaze/kernel/prom.c::early_init_devtree()
powerpc/kernel/prom.c::early_init_devtree()
openrisc/kernel/prom.c::early_init_devtree()
sh/mm/init.c::paging_init()
sparc/mm/init_64.c::paging_init()
unicore32/mm/init.c::uc32_memblock_init()
* In the following users, analyze was used to update total size which
is no longer necessary.
powerpc/kernel/machine_kexec.c::reserve_crashkernel()
powerpc/kernel/prom.c::early_init_devtree()
powerpc/mm/init_32.c::MMU_init()
powerpc/mm/tlb_nohash.c::__early_init_mmu()
powerpc/platforms/ps3/mm.c::ps3_mm_add_memory()
powerpc/platforms/embedded6xx/wii.c::wii_memory_fixups()
sh/kernel/machine_kexec.c::reserve_crashkernel()
* x86/kernel/e820.c::memblock_x86_fill() was directly setting
memblock_can_resize before populating memblock and calling analyze
afterwards. Call memblock_allow_resize() before start populating.
memblock_can_resize is now static inside memblock.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: "H. Peter Anvin" <hpa@zytor.com>
* early_init_devtree(): Total memory size is aligned to PAGE_SIZE;
however, alignment isn't enforced if memory_limit is explicitly
specified. Simplify the logic and always apply PAGE_SIZE alignment.
* MMU_init(): memblock regions is truncated by directly modifying
memblock.memory.cnt. This is incomplete (reserved array is not
truncated) and unnecessarily low level hindering further memblock
improvments. Use memblock_enforce_memory_limit() instead.
* wii_memory_fixups(): Unnecessarily low level direct manipulation of
memblock regions. The same result can be achieved using properly
abstracted operations. Reimplement using memblock API.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
With CONFIG_STRICT_DEVMEM=y, user space cannot read any part of /dev/mem.
Since this breaks librtas, punch a hole in /dev/mem to allow access to the
rmo_buffer that librtas needs.
Anton Blanchard reported the problem and helped with the fix.
A quick test for this patch:
# cat /proc/rtas/rmo_buffer
000000000f190000 10000
# python -c "print 0x000000000f190000 / 0x10000"
3865
# dd if=/dev/mem of=/tmp/foo count=1 bs=64k skip=3865
1+0 records in
1+0 records out
65536 bytes (66 kB) copied, 0.000205235 s, 319 MB/s
# dd if=/dev/mem of=/tmp/foo
dd: reading `/dev/mem': Operation not permitted
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.00022519 s, 0.0 kB/s
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This avoids an extra find_vma() and is less error-prone.
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For 64-bit FSL_BOOKE implementations, gigantic pages need to be
reserved at boot time by the memblock code based on the command line.
This adds the call that handles the reservation, and fixes some code
comments.
It also removes the previous pr_err when reserve_hugetlb_gpages
is called on a system without hugetlb enabled - the way the code is
structured, the call is unconditional and the resulting error message
spurious and confusing.
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Before hugetlb, at each level of the table, we test for
!0 to determine if we have a valid table entry. With hugetlb, this
compare becomes:
< 0 is a normal entry
0 is an invalid entry
> 0 is huge
This works because the hugepage code pulls the top bit off the entry
(which for non-huge entries always has the top bit set) as an
indicator that we have a hugepage.
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
I happened to comment this code while I was digging through it;
we might as well commit that. I also made some whitespace
changes - the existing code had a lot of unnecessary newlines
that I found annoying when I was working on my tiny laptop.
No functional changes.
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The original 32-bit hugetlb implementation used PPC64 vs PPC32 to
determine which code path to take. However, the final hugetlb
implementation for 64-bit FSL ended up shared with the FSL
32-bit code so the actual check needs to be FSL_BOOK3E vs
everything else. This patch changes the include protections to
reflect this.
There are also a couple of related comment fixes.
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This updates the hugetlb page table code to handle 64-bit FSL_BOOKE.
The previous 32-bit work counted on the inner levels of the page table
collapsing.
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch does 2 things: It corrects the code that determines the
size to write into MAS1 for the PPC_MM_SLICES case (this originally
came from David Gibson and I had incorrectly altered it), and it
changes the methodolody used to calculate the size for !PPC_MM_SLICES
to work for 64-bit as well as 32-bit.
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we don't have slices, we should be able to use the generic
hugetlb_get_unmapped_area() code
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The below patch fixes some typos in various parts of the kernel, as well as fixes some comments.
Please let me know if I missed anything, and I will try to get it changed and resent.
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Conflicts & resolutions:
* arch/x86/xen/setup.c
dc91c728fd "xen: allow extra memory to be in multiple regions"
24aa07882b "memblock, x86: Replace memblock_x86_reserve/free..."
conflicted on xen_add_extra_mem() updates. The resolution is
trivial as the latter just want to replace
memblock_x86_reserve_range() with memblock_reserve().
* drivers/pci/intel-iommu.c
166e9278a3 "x86/ia64: intel-iommu: move to drivers/iommu/"
5dfe8660a3 "bootmem: Replace work_with_active_regions() with..."
conflicted as the former moved the file under drivers/iommu/.
Resolved by applying the chnages from the latter on the moved
file.
* mm/Kconfig
6661672053 "memblock: add NO_BOOTMEM config symbol"
c378ddd53f "memblock, x86: Make ARCH_DISCARD_MEMBLOCK a config option"
conflicted trivially. Both added config options. Just
letting both add their own options resolves the conflict.
* mm/memblock.c
d1f0ece6cd "mm/memblock.c: small function definition fixes"
ed7b56a799 "memblock: Remove memblock_memory_can_coalesce()"
confliected. The former updates function removed by the
latter. Resolution is trivial.
Signed-off-by: Tejun Heo <tj@kernel.org>
Since commit 8a0a9bd4db, this comment in mmap_rnd() does not
hold true as the value returned by get_random_int() will in fact be
different every single call. Remove the comment and simplify the code
back to its original desired form.
This reverts commit a5adc91a4b which is no longer necessary and
also fixes the sparc code that copied this same adjustment.
Signed-off-by: Dan McGee <dpmcgee@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
As described in the help text in the patch, this token restricts general
access to /dev/mem as a way of increasing the security. Specifically, access
to exclusive IOMEM and kernel RAM is denied unless CONFIG_STRICT_DEVMEM is
set to 'n'.
Implement the 'devmem_is_allowed()' interface for Powerpc. It will be
called from range_is_allowed() when userpsace attempts to access /dev/mem.
This patch is based on an earlier patch from Steve Best and with input from
Paul Mackerras and Scott Wood.
[BenH] Fixed a typo or two and removed the generic change which should
be submitted as a separate patch
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch adds a fault handler that responds to illegal Coprocessor
types. Currently all CTs are treated and illegal. There are two ways
to report the fault back to the application. If the application used
the record form ("icswx.") then the architected "reject" is emulated.
If the application did not used the record form ("icswx") then it is
selectable by config whether the failure is silent (as architected) or
a SIGILL is generated.
In all cases pr_warn() is used to log the bad CT.
Signed-off-by: Jimi Xenidis <jimix@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Some processors, like embedded, that already have a PID register that
is managed by the system. This patch separates the ACOP and PID
processing into separate files so that the ACOP code can be shared.
Signed-off-by: Jimi Xenidis <jimix@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
arch/powerpc/mm/hugetlbpage.c: In function 'reserve_hugetlb_gpages':
arch/powerpc/mm/hugetlbpage.c:312:2: error: implicit declaration of function 'parse_args'
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch adds support for numa topology on powernv platforms running
OPAL formware. It checks for the type of platform at run time and
sets the affinity form correctly so that NUMA topology can be discovered
correctly.
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We've resisted adding System RAM to /proc/iomem because it is
the wrong place for it. Unfortunately we continue to find tools
that rely on this behaviour so give up and add it in.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (106 commits)
powerpc/p3060qds: Add support for P3060QDS board
powerpc/83xx: Add shutdown request support to MCU handling on MPC8349 MITX
powerpc/85xx: Make kexec to interate over online cpus
powerpc/fsl_booke: Fix comment in head_fsl_booke.S
powerpc/85xx: issue 15 EOI after core reset for FSL CoreNet devices
powerpc/8xxx: Fix interrupt handling in MPC8xxx GPIO driver
powerpc/85xx: Add 'fsl,pq3-gpio' compatiable for GPIO driver
powerpc/86xx: Correct Gianfar support for GE boards
powerpc/cpm: Clear muram before it is in use.
drivers/virt: add ioctl for 32-bit compat on 64-bit to fsl-hv-manager
powerpc/fsl_msi: add support for "msi-address-64" property
powerpc/85xx: Setup secondary cores PIR with hard SMP id
powerpc/fsl-booke: Fix settlbcam for 64-bit
powerpc/85xx: Adding DCSR node to dtsi device trees
powerpc/85xx: clean up FPGA device tree nodes for Freecsale QorIQ boards
powerpc/85xx: fix PHYS_64BIT selection for P1022DS
powerpc/fsl-booke: Fix setup_initial_memory_limit to not blindly map
powerpc: respect mem= setting for early memory limit setup
powerpc: Update corenet64_smp_defconfig
powerpc: Update mpc85xx/corenet 32-bit defconfigs
...
Fix up trivial conflicts in:
- arch/powerpc/configs/40x/hcu4_defconfig
removed stale file, edited elsewhere
- arch/powerpc/include/asm/udbg.h, arch/powerpc/kernel/udbg.c:
added opal and gelic drivers vs added ePAPR driver
- drivers/tty/serial/8250.c
moved UPIO_TSI to powerpc vs removed UPIO_DWAPB support
This avoids duplicating the function in every arch gup_fast.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
powerpc didn't return 0 in that case, if it's rolling back the *nr pointer
it should also return zero to avoid adding pages to the array at the wrong
offset.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Up to this point the code assumed old refcounting for hugepages (pre-thp).
This updates the code directly to the thp mapcount tail page refcounting.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We only taken "refs" pins on the head page not "*nr" pins.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"page" may have changed to point to the next hugepage after the loop
completed, The references have been taken on the head page, so the
put_page must happen there too.
This is a longstanding issue pre-thp inclusion.
It's totally unclear how these page_cache_add_speculative and
pte_val(pte) != pte_val(*ptep) checks are necessary across all the
powerpc gup_fast code, when x86 doesn't need any of that: there's no way
the page can be freed with irq disabled so we're guaranteed the
atomic_inc will happen on a page with page_count > 0 (so not needing the
speculative check).
The pte check is also meaningless on x86: no need to rollback on x86 if
the pte changed, because the pte can still change a CPU tick after the
check succeeded and it won't be rolled back in that case. The important
thing is we got a reference on a valid page that was mapped there a CPU
tick ago. So not knowing the soft tlb refill code of ppc64 in great
detail I'm not removing the "speculative" page_count increase and the
pte checks across all the code, but unless there's a strong reason for
it they should be later cleaned up too.
If a pte can change from huge to non-huge (like it could happen with
THP) passing a pte_t *ptep to gup_hugepte() would also require to repeat
the is_hugepd in gup_hugepte(), but that shouldn't happen with hugetlbfs
only so I'm not altering that.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This part of gup_fast doesn't seem capable of handling hugetlbfs ptes,
those should be handled by gup_hugepd only, so these checks are
superfluous.
Plus if this wasn't a noop, it would have oopsed because, the insistence
of using the speculative refcounting would trigger a VM_BUG_ON if a tail
page was encountered in the page_cache_get_speculative().
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().
He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().
So the best way to fix the problem is to keep page_tail->_count zero at
all times. This will guarantee that get_page_unless_zero() can never
succeed on any tail page. page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).
While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages. That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic. As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns. It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()). get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).
The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages. The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it. A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead. So it's worth it. Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages(). The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.
This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All these files were including module.h just for the basic
EXPORT_SYMBOL infrastructure. We can shift them off to the
export.h header which is a way smaller footprint and thus
realize some compile time gains.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Fix failures in powerpc associated with the previously allowed
implicit module.h presence that now lead to things like this:
arch/powerpc/mm/mmu_context_hash32.c:76:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL'
arch/powerpc/mm/tlb_hash32.c:48:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL'
arch/powerpc/kernel/pci_32.c:51:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL'
arch/powerpc/kernel/iomap.c:36:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL'
arch/powerpc/platforms/44x/canyonlands.c:126:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL'
arch/powerpc/kvm/44x.c:168:59: error: 'THIS_MODULE' undeclared (first use in this function)
[with several contibutions from Stephen Rothwell <sfr@canb.auug.org.au>]
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
With module.h being implicitly everywhere via device.h, the absence
of explicitly including something for EXPORT_SYMBOL went unnoticed.
Since we are heading to fix things up and clean module.h from the
device.h file, we need to explicitly include these files now.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Currently, it does a cntlzd on the size and then subtracts it from
21.... this doesn't take into account the varying size of a "long".
Just use __ilog instead (and subtract the 10 we have to subtract
to get to the tsize encoding).
Also correct the comment about page sizes supported.
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
On FSL Book-E devices we support multiple large TLB sizes and so we can
get into situations in which the initial 1G TLB size is too big and
we're asked for a size that is not mappable by a single entry (like
512M). The single entry is important because when we bring up secondary
cores they need to ensure any data structure they need to access (eg
PACA or stack) is always mapped.
So we really need to determine what size will actually be mapped by the
first TLB entry to ensure we limit early memory references to that
region. We refactor the map_mem_in_cams() code to provider a helper
function that we can utilize to determine the size of the first TLB
entry while taking into account size and alignment constraints.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Commit 41151e77a4 ("powerpc: Hugetlb for BookE") added some
#ifdef CONFIG_MM_SLICES conditionals to hugetlb_get_unmapped_area()
and vma_mmu_pagesize(). Unfortunately this is not the correct config
symbol; it should be CONFIG_PPC_MM_SLICES. The result is that
attempting to use hugetlbfs on 64-bit Power server processors results
in an infinite stack recursion between get_unmapped_area() and
hugetlb_get_unmapped_area().
This fixes it by changing the #ifdef to use CONFIG_PPC_MM_SLICES
in those functions and also in book3e_hugetlb_preload().
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The icswx code introduced an A-B B-A deadlock:
CPU0 CPU1
---- ----
lock(&anon_vma->mutex);
lock(&mm->mmap_sem);
lock(&anon_vma->mutex);
lock(&mm->mmap_sem);
Instead of using the mmap_sem to keep mm_users constant, take the
page table spinlock.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we echo an address the hypervisor doesn't like to
/sys/devices/system/memory/probe we oops the box:
# echo 0x10000000000 > /sys/devices/system/memory/probe
kernel BUG at arch/powerpc/mm/hash_utils_64.c:541!
The backtrace is:
create_section_mapping
arch_add_memory
add_memory
memory_probe_store
sysdev_class_store
sysfs_write_file
vfs_write
SyS_write
In create_section_mapping we BUG if htab_bolt_mapping returned
an error. A better approach is to return an error which will
propagate back to userspace.
Rerunning the test with this patch applied:
# echo 0x10000000000 > /sys/devices/system/memory/probe
-bash: echo: write error: Invalid argument
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
While converting code to use for_each_node_by_type I noticed a
number of coding style issues.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Use for_each_node_by_type instead of open coding it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
During memory hotplug testing, I got the following warning:
ERROR: Bad of_node_put() on /memory@0
of_node_release
kref_put
of_node_put
of_find_node_by_type
hot_add_node_scn_to_nid
hot_add_scn_to_nid
memory_add_physaddr_to_nid
...
of_find_node_by_type() loop does the of_node_put for us so we only
need the handle the case where we terminate the loop early.
As suggested by Stephen Rothwell we can do the of_node_put
unconditionally outside of the loop since of_node_put handles a
NULL argument fine.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Enable hugepages on Freescale BookE processors. This allows the kernel to
use huge TLB entries to map pages, which can greatly reduce the number of
TLB misses and the amount of TLB thrashing experienced by applications with
large memory footprints. Care should be taken when using this on FSL
processors, as the number of large TLB entries supported by the core is low
(16-64) on current processors.
The supported set of hugepage sizes include 4m, 16m, 64m, 256m, and 1g.
Page sizes larger than the max zone size are called "gigantic" pages and
must be allocated on the command line (and cannot be deallocated).
This is currently only fully implemented for Freescale 32-bit BookE
processors, but there is some infrastructure in the code for
64-bit BooKE.
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (99 commits)
drivers/virt: add missing linux/interrupt.h to fsl_hypervisor.c
powerpc/85xx: fix mpic configuration in CAMP mode
powerpc: Copy back TIF flags on return from softirq stack
powerpc/64: Make server perfmon only built on ppc64 server devices
powerpc/pseries: Fix hvc_vio.c build due to recent changes
powerpc: Exporting boot_cpuid_phys
powerpc: Add CFAR to oops output
hvc_console: Add kdb support
powerpc/pseries: Fix hvterm_raw_get_chars to accept < 16 chars, fixing xmon
powerpc/irq: Quieten irq mapping printks
powerpc: Enable lockup and hung task detectors in pseries and ppc64 defeconfigs
powerpc: Add mpt2sas driver to pseries and ppc64 defconfig
powerpc: Disable IRQs off tracer in ppc64 defconfig
powerpc: Sync pseries and ppc64 defconfigs
powerpc/pseries/hvconsole: Fix dropped console output
hvc_console: Improve tty/console put_chars handling
powerpc/kdump: Fix timeout in crash_kexec_wait_realmode
powerpc/mm: Fix output of total_ram.
powerpc/cpufreq: Add cpufreq driver for Momentum Maple boards
powerpc: Correct annotations of pmu registration functions
...
Fix up trivial Kconfig/Makefile conflicts in arch/powerpc, drivers, and
drivers/cpufreq
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (123 commits)
perf: Remove the nmi parameter from the oprofile_perf backend
x86, perf: Make copy_from_user_nmi() a library function
perf: Remove perf_event_attr::type check
x86, perf: P4 PMU - Fix typos in comments and style cleanup
perf tools: Make test use the preset debugfs path
perf tools: Add automated tests for events parsing
perf tools: De-opt the parse_events function
perf script: Fix display of IP address for non-callchain path
perf tools: Fix endian conversion reading event attr from file header
perf tools: Add missing 'node' alias to the hw_cache[] array
perf probe: Support adding probes on offline kernel modules
perf probe: Add probed module in front of function
perf probe: Introduce debuginfo to encapsulate dwarf information
perf-probe: Move dwarf library routines to dwarf-aux.{c, h}
perf probe: Remove redundant dwarf functions
perf probe: Move strtailcmp to string.c
perf probe: Rename DIE_FIND_CB_FOUND to DIE_FIND_CB_END
tracing/kprobe: Update symbol reference when loading module
tracing/kprobes: Support module init function probing
kprobes: Return -ENOENT if probe point doesn't exist
...
On 32bit platforms that support >= 4GB memory total_ram was truncated.
This creates a confusing printk:
Top of RAM: 0x100000000, Total RAM: 0x0
Fix that:
Top of RAM: 0x100000000, Total RAM: 0x100000000
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Acked-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Callback based iteration is cumbersome and much less useful than
for_each_*() iterator. This patch implements for_each_mem_pfn_range()
which replaces work_with_active_regions(). All the current users of
work_with_active_regions() are converted.
This simplifies walking over early_node_map and will allow converting
internal logics in page_alloc to use iterator instead of walking
early_node_map directly, which in turn will enable moving node
information to memblock.
powerpc change is only compile tested.
Signed-off-by: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20110714074610.GD3455@htj.dyndns.org
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The 44x code (which is shared by 47x) assumes the available physical memory
begins at 0x00000000. This is not necessarily the case in an AMP
environment.
Support CONFIG_RELOCATABLE for 476 in order to allow the kernel to be
loaded into a higher memory range.
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Since other OS's may be running on the other cores don't use tlbivax
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>
This adds support for running KVM guests in supervisor mode on those
PPC970 processors that have a usable hypervisor mode. Unfortunately,
Apple G5 machines have supervisor mode disabled (MSR[HV] is forced to
1), but the YDL PowerStation does have a usable hypervisor mode.
There are several differences between the PPC970 and POWER7 in how
guests are managed. These differences are accommodated using the
CPU_FTR_ARCH_201 (PPC970) and CPU_FTR_ARCH_206 (POWER7) CPU feature
bits. Notably, on PPC970:
* The LPCR, LPID or RMOR registers don't exist, and the functions of
those registers are provided by bits in HID4 and one bit in HID0.
* External interrupts can be directed to the hypervisor, but unlike
POWER7 they are masked by MSR[EE] in non-hypervisor modes and use
SRR0/1 not HSRR0/1.
* There is no virtual RMA (VRMA) mode; the guest must use an RMO
(real mode offset) area.
* The TLB entries are not tagged with the LPID, so it is necessary to
flush the whole TLB on partition switch. Furthermore, when switching
partitions we have to ensure that no other CPU is executing the tlbie
or tlbsync instructions in either the old or the new partition,
otherwise undefined behaviour can occur.
* The PMU has 8 counters (PMC registers) rather than 6.
* The DSCR, PURR, SPURR, AMR, AMOR, UAMOR registers don't exist.
* The SLB has 64 entries rather than 32.
* There is no mediated external interrupt facility, so if we switch to
a guest that has a virtual external interrupt pending but the guest
has MSR[EE] = 0, we have to arrange to have an interrupt pending for
it so that we can get control back once it re-enables interrupts. We
do that by sending ourselves an IPI with smp_send_reschedule after
hard-disabling interrupts.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This replaces the single CPU_FTR_HVMODE_206 bit with two bits, one to
indicate that we have a usable hypervisor mode, and another to indicate
that the processor conforms to PowerISA version 2.06. We also add
another bit to indicate that the processor conforms to ISA version 2.01
and set that for PPC970 and derivatives.
Some PPC970 chips (specifically those in Apple machines) have a
hypervisor mode in that MSR[HV] is always 1, but the hypervisor mode
is not useful in the sense that there is no way to run any code in
supervisor mode (HV=0 PR=0). On these processors, the LPES0 and LPES1
bits in HID4 are always 0, and we use that as a way of detecting that
hypervisor mode is not useful.
Where we have a feature section in assembly code around code that
only applies on POWER7 in hypervisor mode, we use a construct like
END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
The definition of END_FTR_SECTION_IFSET is such that the code will
be enabled (not overwritten with nops) only if all bits in the
provided mask are set.
Note that the CPU feature check in __tlbie() only needs to check the
ARCH_206 bit, not the HVMODE bit, because __tlbie() can only get called
if we are running bare-metal, i.e. in hypervisor mode.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This is used to round-robin TLBCAM entries.
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
The nmi parameter indicated if we could do wakeups from the current
context, if not, we would set some state and self-IPI and let the
resulting interrupt do the wakeup.
For the various event classes:
- hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
the PMI-tail (ARM etc.)
- tracepoint: nmi=0; since tracepoint could be from NMI context.
- software: nmi=[0,1]; some, like the schedule thing cannot
perform wakeups, and hence need 0.
As one can see, there is very little nmi=1 usage, and the down-side of
not using it is that on some platforms some software events can have a
jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).
The up-side however is that we can remove the nmi parameter and save a
bunch of conditionals in fast paths.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michael Cree <mcree@orcon.net.nz>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds a printk companion to replace the udbg progress function
when initmem is freed.
Suggested-by: Milton Miller <miltonm@bga.com>
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Dave Carroll <dcarroll@astekcorp.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The free_initmem function is basically duplicated in mm/init_32,
and init_64, and is moved to the common 32/64-bit mm/mem.c.
All other sections except init were removed in v2.6.15 by
6c45ab992e (powerpc: Remove section
free() and linker script bits), and therefore the bulk of the executed
code is identical.
This patch also removes updating ppc_md.progress to NULL in the powermac
late_initcall.
Suggested-by: Milton Miller <miltonm@bga.com>
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Dave Carroll <dcarroll@astekcorp.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This has been broken for a while but hasn't been an issue until
now because nobody was reserving regions at high addresses.
Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On MMUs such as FSL where we can guarantee the entire linear mapping is
bolted, we don't need to worry about linear TLB misses. If on top of
that we do a full table walk, we get rid of all recursive TLB faults, and
can dispense with some state saving. This gains a few percent on
TLB-miss-heavy workloads, and around 50% on a benchmark that had a high
rate of virtual page table faults under the normal handler.
While touching the EX_TLB layout, remove EX_TLB_MMUCR0, EX_TLB_SRR0, and
EX_TLB_SRR1 as they're not used.
[BenH: Fixed build with 64K pages (wsp config)]
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since printk_ratelimit() shouldn't be used anymore (see comment in
include/linux/printk.h), replace it with printk_ratelimited.
Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Before if we didn't support or enable HW table walk we'd get a messaage
like:
MMU: Book3E Page Tables Disabled
Which is a bit misleading. Now it will say:
MMU: Book3E HW tablewalk not supported
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When using 64K pages with a separate cpio rootfs, U-Boot will align
the rootfs on a 4K page boundary. When the memory is reserved, and
subsequent early memblock_alloc is called, it will allocate memory
between the 64K page alignment and reserved memory. When the reserved
memory is subsequently freed, it is done so by pages, causing the
early memblock_alloc requests to be re-used, which in my case, caused
the device-tree to be clobbered.
This patch forces the reserved memory for initrd to be kernel page
aligned, and will move the device tree if it overlaps with the range
extension of initrd. This patch will also consolidate the identical
function free_initrd_mem() from mm/init_32.c, init_64.c to mm/mem.c,
and adds the same range extension when freeing initrd. free_initrd_mem()
is also moved to the __init section.
Many thanks to Milton Miller for his input on this patch.
[BenH: Fixed build without CONFIG_BLK_DEV_INITRD]
Signed-off-by: Dave Carroll <dcarroll@astekcorp.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
In case other architectures require RCU freed page-tables to implement
gup_fast() and software filled hashes and similar things, provide the
means to do so by moving the logic into generic code.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Requested-by: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix up powerpc to the new mmu_gather stuff.
PPC has an extra batching queue to RCU free the actual pagetable
allocations, use the ARCH extentions for that for now.
For the ppc64_tlb_batch, which tracks the vaddrs to unhash from the
hardware hash-table, keep using per-cpu arrays but flush on context switch
and use a TLF bit to track the lazy_mmu state.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Miller <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have a confusing number of ioremap functions. Make things just a
bit simpler by merging ioremap_flags and ioremap_prot.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add ioremap_wc so drivers can request write combining on kernel
mappings.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Due to a collision between NO_CONTEXT->MMU_NO_CONTEXT change and
Anton's patch.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Adapt new API.
Almost change is trivial. Most important change is the below line
because we plan to change task->cpus_allowed implementation.
- ctx->cpus_allowed = current->cpus_allowed;
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Icswx is a PowerPC instruction to send data to a co-processor. On Book-S
processors the LPAR_ID and process ID (PID) of the owning process are
registered in the window context of the co-processor at initialization
time. When the icswx instruction is executed the L2 generates a cop-reg
transaction on PowerBus. The transaction has no address and the
processor does not perform an MMU access to authenticate the transaction.
The co-processor compares the LPAR_ID and the PID included in the
transaction and the LPAR_ID and PID held in the window context to
determine if the process is authorized to generate the transaction.
The OS needs to assign a 16-bit PID for the process. This cop-PID needs
to be updated during context switch. The cop-PID needs to be destroyed
when the context is destroyed.
Signed-off-by: Sonny Rao <sonnyrao@linux.vnet.ibm.com>
Signed-off-by: Tseng-Hui (Frank) Lin <thlin@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This removes MMU_FTR_TLBIE_206 as we can now use CPU_FTR_HVMODE_206. It
also changes the logic to select which tlbie to use to be based on this
new CPU feature bit.
This also duplicates the ASM_FTR_IF/SET/CLR defines for CPU features
(copied from MMU features).
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Some of the 64bit PPC CPU features are MMU-related, so this patch moves
them to MMU_FTR_ bits. All cpu_has_feature()-style tests are moved to
mmu_has_feature(), and seven feature bits are freed as a result.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we don't find ibm,associativity-reference-points as a child of
/rtas, look for it at the root of the tree instead. We use this on
Book3E where we have no RTAS but still use the sPAPR conventions
for NUMA.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There are a few places we patch instructions without using
patch_instruction and patch_branch, probably because they
predated it. Fix it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently we allocate the stale_map for a cpu when it comes online,
this leaves open a small window where a process can be scheduled
on the cpu before the stale_map is allocated. Instead allocate
the stale_map at CPU_UP_PREPARE time, that way it will be always
available before tasks start running.
It is possible the cpu fails to come up, in which case we should free
the stale_map, so add a CPU_UP_CANCELED case to do that.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Use MMU_NO_CONTEXT as the initialiser for mm_context.id on
nohash and hash64.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Fix some minor typos:
* informations => information
* there own => their own
* these => this
Signed-off-by: Sylvestre Ledru <sylvestre.ledru@scilab.org>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is used by Alsa to mmap buffers allocated with dma_alloc_coherent()
into userspace. We need a special variant to handle machines with
non-coherent DMAs as those buffers have "special" virt addresses and
require non-cachable mappings
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This feature triggers nasty races in the scheduler between the
rebuilding of the topology and the load balancing code, causing
the machine to hang.
Disable it for now until the races are fixed.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
memblock_enforce_memory_limit() takes the desired maximum quantity of memory
to end up with, not an address above which memory will not be used.
Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
hpte_need_flush() might be called outside of a preempt section
when manipulating the kernel page tables, so we need to use the
appopriate variants of per-cpu variable accesses. There should
be no risk of being in the middle of a batch and a context
switch will flush any pending batch.
[Patch extracted from a larger patch in Peter's preemptible
mmu_gather series]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When converting to the new cpumask code I screwed up:
- if (cpu_isset(cpu, numa_cpumask_lookup_table[node])) {
- cpu_clear(cpu, numa_cpumask_lookup_table[node]);
+ if (cpumask_test_cpu(cpu, node_to_cpumask_map[node])) {
+ cpumask_set_cpu(cpu, node_to_cpumask_map[node]);
This was introduced in commit 25863de07a (powerpc/cpumask: Convert NUMA code
to new cpumask API)
Fix it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>