Commit Graph

705 Commits

Author SHA1 Message Date
Benjamin Herrenschmidt 1ed31d6db9 Merge commit 'origin/master' into next 2010-05-07 11:29:25 +10:00
Benjamin Herrenschmidt 2ef613cb94 powerpc/cpumask: Convert mpic driver to new cpumask API
Convert to the new cpumask API.

irq_choose_cpu can be simplified by using cpumask_next and cpumask_first.

smp_mpic_message_pass was doing open coded cpumask manipulation and passing an
int for a cpumask into mpic_send_ipi. Since mpic_send_ipi is only used
locally, make it static and convert it to take a cpumask. This allows us
to clean up the mess in smp_mpic_message_pass.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-05-06 18:01:46 +10:00
Anton Blanchard 25863de07a powerpc/cpumask: Convert NUMA code to new cpumask API
Convert NUMA code to new cpumask API. We shift the node to cpumask
setup code until after we complete bootmem allocation so we can
dynamically allocate the cpumasks.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-05-06 17:41:58 +10:00
Anton Blanchard cc1ba8ea6d powerpc/cpumask: Dynamically allocate cpu_sibling_map and cpu_core_map cpumasks
Dynamically allocate cpu_sibling_map and cpu_core_map cpumasks.

We don't need to set_cpu_online() the boot cpu in smp_prepare_boot_cpu,
init/main.c does it for us.

We also postpone setting of the boot cpu in cpu_sibling_map and cpu_core_map
until when the memory allocator is available (smp_prepare_cpus), similar
to x86.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-05-06 17:41:56 +10:00
Anton Blanchard b6decb7079 powerpc/cpumask: Convert fixup_irqs to new cpumask API
Use new cpumask_* functions, and dynamically allocate cpumask in fixup_irqs.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-05-06 17:16:14 +10:00
Mark Nelson 91eea67c6d powerpc/mm: Track backing pages allocated by vmemmap_populate()
We need to keep track of the backing pages that get allocated by
vmemmap_populate() so that when we use kdump, the dump-capture kernel knows
where these pages are.

We use a simple linked list of structures that contain the physical address
of the backing page and corresponding virtual address to track the backing
pages.
To save space, we just use a pointer to the next struct vmemmap_backing. We
can also do this because we never remove nodes.  We call the pointer "list"
to be compatible with changes made to the crash utility.

vmemmap_populate() is called either at boot-time or on a memory hotplug
operation. We don't have to worry about the boot-time calls because they
will be inherently single-threaded, and for a memory hotplug operation
vmemmap_populate() is called through:
sparse_add_one_section()
            |
            V
kmalloc_section_memmap()
            |
            V
sparse_mem_map_populate()
            |
            V
vmemmap_populate()
and in sparse_add_one_section() we're protected by pgdat_resize_lock().
So, we don't need a spinlock to protect the vmemmap_list.

We allocate space for the vmemmap_backing structs by allocating whole pages
in vmemmap_list_alloc() and then handing out chunks of this to
vmemmap_list_populate().

This means that we waste at most just under one page, but this keeps the code
is simple.

Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-05-06 16:49:27 +10:00
Martyn Welch 7cad197849 powerpc: Correct parport interrupt parsing
Currently the parsing of the device tree in
arch/powerpc/include/asm/parport.h assumes that the interrupt provided in
the parallel port node is a valid virtual irq. The values for the
interrupts provided in the device tree should have meaning in the context
of the driver for the specific interrupt controller to which the interrupt
is connected and irq_of_parse_and_map() should be used to determine the
correct virtual irq.

Signed-off-by: Martyn Welch <martyn.welch@ge.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-05-06 16:49:26 +10:00
Torez Smith b4e8c8dd84 powerpc/4xx: Simple platform for the ISS 4xx simulator
This is a trivial 4xx plaform that uses the new simple bsp from
Josh and is handy to use in simulators such as ISS or even Mambo
who don't properly implement most of the actual devices in the
SoC but really only the core.

Signed-off-by: Torez Smith  <lnxtorez@linux.vnet.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>
2010-05-05 11:11:56 -04:00
Dave Kleikamp fc5e709731 powerpc/476: add machine check handler for 47x core
The 47x core's MCSR varies from 44x, so it needs it's own machine check
handler.

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>
2010-05-05 09:27:22 -04:00
Dave Kleikamp e7f75ad01d powerpc/47x: Base ppc476 support
This patch adds the base support for the 476 processor.  The code was
primarily written by Ben Herrenschmidt and Torez Smith, but I've been
maintaining it for a while.

The goal is to have a single binary that will run on 44x and 47x, but
we still have some details to work out.  The biggest is that the L1 cache
line size differs on the two platforms, but it's currently a compile-time
option.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Torez Smith  <lnxtorez@linux.vnet.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>
2010-05-05 09:11:10 -04:00
Kumar Gala dbc9632a8c powerpc/fsl-booke: Fix CONFIG_RELOCATABLE support on FSL Book-E ppc32
The following commit broke CONFIG_RELOCATABLE support on FSL Book-E
parts:

commit 549e8152de
Author: Paul Mackerras <paulus@samba.org>
Date:   Sat Aug 30 11:43:47 2008 +1000

    powerpc: Make the 64-bit kernel as a position-independent executable

The change to __va and __pa to use PAGE_OFFSET & MEMORY_START causes
problems on the Book-E parts because we don't know MEMORY_START until
after we parse the device tree.  We need __va to work properly to even
parse the device tree so we have a chicken an egg.  So go back to using
he other definition of __va/__pa on CONFIG_BOOKE and use the
PAGE_OFFSET/MEMORY_START version on "Classic" PPC64.

Also updated casts to handle phys_addr_t being a different size from
unsigned long (ie 36-bit physical on PPC32).

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2010-04-26 17:54:15 -05:00
Benjamin Herrenschmidt cb694769f0 Revert "powerpc/mm: Bump SECTION_SIZE_BITS from 16MB to 256MB"
This reverts commit 7545ba6f82.

It breaks eHEA among other issues
2010-04-13 13:54:39 +10:00
Mahesh Salgaonkar 359e4284a3 powerpc: Add kprobe-based event tracer
This patch ports the kprobe-based event tracer to powerpc. This patch
is based on x86 port. This brings powerpc on par with x86.

Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-04-07 18:11:43 +10:00
Anton Blanchard 7545ba6f82 powerpc/mm: Bump SECTION_SIZE_BITS from 16MB to 256MB
The current setting for SECTION_SIZE_BITS is quite small compared to
everyone else:

arch/powerpc/include/asm/sparsemem.h:#define SECTION_SIZE_BITS  24

arch/sparc/include/asm/sparsemem.h:#define SECTION_SIZE_BITS    30
arch/ia64/include/asm/sparsemem.h:#define SECTION_SIZE_BITS     (30)
arch/s390/include/asm/sparsemem.h:#define SECTION_SIZE_BITS     28
arch/x86/include/asm/sparsemem.h:# define SECTION_SIZE_BITS     27

And it has proven to be an issue during boot on very large machines.
If hotplug memory is enabled, drivers/base/memory.c does this:

       for (i = 0; i < NR_MEM_SECTIONS; i++) {
                if (!present_section_nr(i))
                        continue;
                err = add_memory_block(0, __nr_to_section(i), MEM_ONLINE,
                                        0, BOOT);
                if (!ret)
                        ret = err;
        }

Which creates a sysfs directory for every 16MB of memory. As a result
I'm seeing up to 30 minutes spent here during boot:

c000000000248ee0 .__sysfs_add_one+0x28/0x128
c0000000002492a8 .sysfs_add_one+0x38/0x188
c000000000249c88 .create_dir+0x70/0x138
c000000000249d98 .sysfs_create_dir+0x48/0x78
c00000000032bad8 .kobject_add_internal+0x140/0x308
c00000000032beb4 .kobject_init_and_add+0x4c/0x68
c00000000046c2c0 .sysdev_register+0xa0/0x220
c00000000047b1dc .add_memory_block+0x124/0x1e8
c0000000008d1f28 .memory_dev_init+0xf4/0x168
c0000000008d1b64 .driver_init+0x50/0x64
c000000000890378 .do_basic_setup+0x40/0xd4

I assume there are some O(n^2) issues in sysfs as we add all the memory
nodes. Bumping SECTION_SIZE_BITS to 256 MB drops the time to about 10
seconds and results in a much smaller /sys.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-04-07 18:00:49 +10:00
Anton Blanchard 27f10907b7 powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It gets
enabled via this:

        /*
         * If another node is sufficiently far away then it is better
         * to reclaim pages in a zone before going off node.
         */
        if (distance > RECLAIM_DISTANCE)
                zone_reclaim_mode = 1;

Since we use the default value of 20 for REMOTE_DISTANCE and 20 for
RECLAIM_DISTANCE it never kicks in.

The local to remote bandwidth ratios can be quite large on System p
machines so it makes sense for us to reclaim clean pagecache locally before
going off node.

The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
zone reclaim.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-04-07 18:00:47 +10:00
Jason Gunthorpe 43b5fefc24 powerpc/ppc32: Fixup pmd_page to work when ARCH_PFN_OFFSET is non-zero
Instead of referencing mem_map directly, use pfn_to_page. Otherwise
the kernel crashes when trying to start userspace if ARCH_PFN_OFFSET is
non-zero and CONFIG_BOOKE is not defined

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-04-07 18:00:30 +10:00
Vaidyanathan Srinivasan 6fe9d1facb powerpc/pseries: Export data from new hcall H_EM_GET_PARMS
Add support for H_EM_GET_PARMS hcall that will return data
related to power modes from the platform.  Export the data
directly to user space for administrative tools to interpret
and use.

cat /proc/powerpc/lparcfg will export power mode data

Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-04-07 18:00:29 +10:00
Linus Torvalds 6fa41366c1 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  powerpc/perf_events: Fix call-graph recording, add perf_arch_fetch_caller_regs
  perf top: Add missing initialization to zero
  perf probe: Use original address instead of CU-based address
  perf probe: Fix offset to allow signed value
  perf top: Improve the autosizing of column lenghts
  perf probe: Fix need_dwarf flag if lazy matching is used
  perf probe: Fix probe_point buffer overrun
2010-03-26 15:09:33 -07:00
Nathan Lynch 409d241b7b powerpc: Use correct ccr bit for syscall error status
The powerpc implementations of syscall_get_error and
syscall_set_return_value should use CCR0:S0 (0x10000000) for testing
and setting syscall error status.  Fortunately these APIs don't seem
to be used at the moment.

Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-03-19 16:38:16 +11:00
Benjamin Herrenschmidt d6a8536a93 Merge commit 'kumar/merge' into merge 2010-03-19 16:23:55 +11:00
Paul Mackerras 9eff26ea48 powerpc/perf_events: Fix call-graph recording, add perf_arch_fetch_caller_regs
This implements a powerpc version of perf_arch_fetch_caller_regs
to get correct call-graphs.

It's implemented in assembly because that way we can be sure there isn't
a stack frame for perf_arch_fetch_caller_regs.  If it was in C, gcc might
or might not create a stack frame for it, which would affect the number
of levels we have to skip.

With this, we see results from perf record -e lock:lock_acquire like
this:

 # Samples: 24878
 #
 # Overhead         Command      Shared Object  Symbol
 # ........  ..............  .................  ......
 #
    14.99%            perf  [kernel.kallsyms]  [k] ._raw_spin_lock
                      |
                      --- ._raw_spin_lock
                         |
                         |--25.00%-- .alloc_fd
                         |          (nil)
                         |          |
                         |          |--50.00%-- .anon_inode_getfd
                         |          |          .sys_perf_event_open
                         |          |          syscall_exit
                         |          |          syscall
                         |          |          create_counter
                         |          |          __cmd_record
                         |          |          run_builtin
                         |          |          main
                         |          |          0xfd2e704
                         |          |          0xfd2e8c0
                         |          |          (nil)

... etc.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: anton@samba.org
Cc: linuxppc-dev@ozlabs.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100318050513.GA6575@drongo>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-03-18 06:48:29 +01:00
Kumar Gala d6ccb1f55d powerpc/85xx: Make sure lwarx hint isn't set on ppc32
e500v1/v2 based chips will treat any reserved field being set in an
opcode as illegal.  Thus always setting the hint in the opcode is
a bad idea.

Anton should be kept away from the powerpc opcode map.

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2010-03-16 23:24:06 -05:00
Linus Torvalds b6fedfd2a1 Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
  powerpc/booke: Fix breakpoint/watchpoint one-shot behavior
  powerpc: Reduce printk from pseries_mach_cpu_die()
  powerpc: Move checks in pseries_mach_cpu_die()
  powerpc: Reset kernel stack on cpu online from cede state
  powerpc: Fix G5 thermal shutdown
  powerpc/pseries: Pass CPPR value to H_XIRR hcall
  powerpc/booke: Fix a couple typos in the advanced ptrace code
  powerpc: Fix SMP build with disabled CPU hotplugging.
  powerpc: Dynamically allocate pacas
  powerpc/perf: e500 support
  powerpc/perf: Build callchain code regardless of hardware event support.
  powerpc/cpm2: Checkpatch cleanup
  powerpc/86xx: Renaming following split of GE Fanuc joint venture
  powerpc/86xx: Convert gef_pic_lock to raw_spinlock
  powerpc/qe: Convert qe_ic_lock to raw_spinlock
  powerpc/82xx: Convert pci_pic_lock to raw_spinlock
  powerpc/85xx: Convert socrates_fpga_pic_lock to raw_spinlock
2010-03-12 16:06:51 -08:00
FUJITA Tomonori 6e6c70e691 dma-mapping: powerpc: use generic pci_set_dma_mask and pci_set_consistent_dma_mask
This converts powerpc to use the generic pci_set_dma_mask and
pci_set_consistent_dma_mask (drivers/pci/pci.c).

The generic pci_set_dma_mask does what powerpc's pci_set_dma_mask does.

Unlike powerpc's pci_set_consistent_dma_mask, the gneric
pci_set_consistent_dma_mask sets only coherent_dma_mask.  It doesn't work
for powerpc?  pci_set_consistent_dma_mask API should set only
coherent_dma_mask?

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Greg KH <greg@kroah.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:42 -08:00
FUJITA Tomonori f41b177157 pci-dma: add linux/pci-dma.h to linux/pci.h
All the architectures properly set NEED_DMA_MAP_STATE now so we can safely
add linux/pci-dma.h to linux/pci.h and remove the linux/pci-dma.h
inclusion in arch's asm/pci.h

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:42 -08:00
FUJITA Tomonori af407c6db1 pci-dma: powerpc: use include/linux/pci-dma.h
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:41 -08:00
Christoph Hellwig dacbe41f77 ptrace: move user_enable_single_step & co prototypes to linux/ptrace.h
While in theory user_enable_single_step/user_disable_single_step/
user_enable_blockstep could also be provided as an inline or macro there's
no good reason to do so, and having the prototype in one places keeps code
size and confusion down.

Roland said:

  The original thought there was that user_enable_single_step() et al
  might well be only an instruction or three on a sane machine (as if we
  have any of those!), and since there is only one call site inlining
  would be beneficial.  But I agree that there is no strong reason to care
  about inlining it.

  As to the arch changes, there is only one thought I'd add to the
  record.  It was always my thinking that for an arch where
  PTRACE_SINGLESTEP does text-modifying breakpoint insertion,
  user_enable_single_step() should not be provided.  That is,
  arch_has_single_step()=>true means that there is an arch facility with
  "pure" semantics that does not have any unexpected side effects.
  Inserting a breakpoint might do very unexpected strange things in
  multi-threaded situations.  Aside from that, it is a peculiar side
  effect that user_{enable,disable}_single_step() should cause COW
  de-sharing of text pages and so forth.  For PTRACE_SINGLESTEP, all these
  peculiarities are the status quo ante for that arch, so having
  arch_ptrace() itself do those is one thing.  But for building other
  things in the future, it is nicer to have a uniform "pure" semantics
  that arch-independent code can expect.

  OTOH, all such arch issues are really up to the arch maintainer.  As
  of today, there is nothing but ptrace using user_enable_single_step() et
  al so it's a distinction without a practical difference.  If/when there
  are other facilities that use user_enable_single_step() and might care,
  the affected arch's can revisit the question when someone cares about
  the quality of the arch support for said new facility.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:38 -08:00
Christoph Hellwig 5cacdb4add Add generic sys_olduname()
Add generic implementations of the old and really old uname system calls.
Note that sh only implements sys_olduname but not sys_oldolduname, but I'm
not going to bother with another ifdef for that special case.

m32r implemented an old uname but never wired it up, so kill it, too.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:32 -08:00
Christoph Hellwig e28cbf2293 improve sys_newuname() for compat architectures
On an architecture that supports 32-bit compat we need to override the
reported machine in uname with the 32-bit value.  Instead of doing this
separately in every architecture introduce a COMPAT_UTS_MACHINE define in
<asm/compat.h> and apply it directly in sys_newuname().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:32 -08:00
Christoph Hellwig baed7fc9b5 Add generic sys_ipc wrapper
Add a generic implementation of the ipc demultiplexer syscall.  Except for
s390 and sparc64 all implementations of the sys_ipc are nearly identical.

There are slight differences in the types of the parameters, where mips
and powerpc as the only 64-bit architectures with sys_ipc use unsigned
long for the "third" argument as it gets casted to a pointer later, while
it traditionally is an "int" like most other paramters.  frv goes even
further and uses unsigned long for all parameters execept for "ptr" which
is a pointer type everywhere.  The change from int to unsigned long for
"third" and back to "int" for the others on frv should be fine due to the
in-register calling conventions for syscalls (we already had a similar
issue with the generic sys_ptrace), but I'd prefer to have the arch
maintainers looks over this in details.

Except for that h8300, m68k and m68knommu lack an impplementation of the
semtimedop sub call which this patch adds, and various architectures have
gets used - at least on i386 it seems superflous as the compat code on
x86-64 and ia64 doesn't even bother to implement it.

[akpm@linux-foundation.org: add sys_ipc to sys_ni.c]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Reviewed-by: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-12 15:52:32 -08:00
Dave Kleikamp 856f70a368 powerpc/booke: Fix a couple typos in the advanced ptrace code
powerpc/booke: Fix a couple typos in the advanced ptrace code

Found and fixed a couple typos in the advanced ptrace patches.
(These patches are currently in benh's next tree.)

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linuxppc-dev list <Linuxppc-dev@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-03-09 11:54:18 +11:00
Michael Ellerman 1426d5a3bd powerpc: Dynamically allocate pacas
On 64-bit kernels we currently have a 512 byte struct paca_struct for
each cpu (usually just called "the paca"). Currently they are statically
allocated, which means a kernel built for a large number of cpus will
waste a lot of space if it's booted on a machine with few cpus.

We can avoid that by only allocating the number of pacas we need at
boot. However this is complicated by the fact that we need to access
the paca before we know how many cpus there are in the system.

The solution is to dynamically allocate enough space for NR_CPUS pacas,
but then later in boot when we know how many cpus we have, we free any
unused pacas.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-03-09 11:52:52 +11:00
Benjamin Herrenschmidt 59603b9ae4 Merge commit 'kumar/next' into merge 2010-03-09 11:51:57 +11:00
Linus Torvalds c812a51d11 Merge branch 'kvm-updates/2.6.34' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.34' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (145 commits)
  KVM: x86: Add KVM_CAP_X86_ROBUST_SINGLESTEP
  KVM: VMX: Update instruction length on intercepted BP
  KVM: Fix emulate_sys[call, enter, exit]()'s fault handling
  KVM: Fix segment descriptor loading
  KVM: Fix load_guest_segment_descriptor() to inject page fault
  KVM: x86 emulator: Forbid modifying CS segment register by mov instruction
  KVM: Convert kvm->requests_lock to raw_spinlock_t
  KVM: Convert i8254/i8259 locks to raw_spinlocks
  KVM: x86 emulator: disallow opcode 82 in 64-bit mode
  KVM: x86 emulator: code style cleanup
  KVM: Plan obsolescence of kernel allocated slots, paravirt mmu
  KVM: x86 emulator: Add LOCK prefix validity checking
  KVM: x86 emulator: Check CPL level during privilege instruction emulation
  KVM: x86 emulator: Fix popf emulation
  KVM: x86 emulator: Check IOPL level during io instruction emulation
  KVM: x86 emulator: fix memory access during x86 emulation
  KVM: x86 emulator: Add Virtual-8086 mode of emulation
  KVM: x86 emulator: Add group9 instruction decoding
  KVM: x86 emulator: Add group8 instruction decoding
  KVM: do not store wqh in irqfd
  ...

Trivial conflicts in Documentation/feature-removal-schedule.txt
2010-03-05 13:12:34 -08:00
Scott Wood a11106544f powerpc/perf: e500 support
This implements perf_event support for the Freescale embedded performance
monitor, based on the existing perf_event.c that supports server/classic
chips.

Some limitations:
- Performance monitor interrupts are regular EE interrupts, and thus you
  can't profile places with interrupts disabled.  We may want to implement
  soft IRQ-disabling, with perfmon interrupts exempted and treated as NMIs.
- When trying to schedule multiple event groups at once, and using
  restricted events, situations could arise where scheduling fails even
  though it would be possible.  Consider three groups, each with two events.
  One group has restricted events, the others don't.  The two non-restricted
  groups are scheduled, then one is removed, which happens to occupy the two
  counters that can't do restricted events.  The remaining non-restricted
  group will not be moved to the non-restricted-capable counters to make
  room if the restricted group tries to be scheduled.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2010-03-05 03:04:08 -06:00
Linus Torvalds 0a135ba14d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: add __percpu sparse annotations to what's left
  percpu: add __percpu sparse annotations to fs
  percpu: add __percpu sparse annotations to core kernel subsystems
  local_t: Remove leftover local.h
  this_cpu: Remove pageset_notifier
  this_cpu: Page allocator conversion
  percpu, x86: Generic inc / dec percpu instructions
  local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c
  module: Use this_cpu_xx to dynamically allocate counters
  local_t: Remove cpu_local_xx macros
  percpu: refactor the code in pcpu_[de]populate_chunk()
  percpu: remove compile warnings caused by __verify_pcpu_ptr()
  percpu: make accessors check for percpu pointer in sparse
  percpu: add __percpu for sparse.
  percpu: make access macros universal
  percpu: remove per_cpu__ prefix.
2010-03-03 07:34:18 -08:00
Linus Torvalds ac0f6f927d Merge branch 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (100 commits)
  ARM: Eliminate decompressor -Dstatic= PIC hack
  ARM: 5958/1: ARM: U300: fix inverted clk round rate
  ARM: 5956/1: misplaced parentheses
  ARM: 5955/1: ep93xx: move timer defines into core.c and document
  ARM: 5954/1: ep93xx: move gpio interrupt support to gpio.c
  ARM: 5953/1: ep93xx: fix broken build of clock.c
  ARM: 5952/1: ARM: MM: Add ARM_L1_CACHE_SHIFT_6 for handle inside each ARCH Kconfig
  ARM: 5949/1: NUC900 add gpio virtual memory map
  ARM: 5948/1: Enable timer0 to time4 clock support for nuc910
  ARM: 5940/2: ARM: MMCI: remove custom DBG macro and printk
  ARM: make_coherent(): fix problems with highpte, part 2
  MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself
  ARM: 5945/1: ep93xx: include correct irq.h in core.c
  ARM: 5933/1: amba-pl011: support hardware flow control
  ARM: 5930/1: Add PKMAP area description to memory.txt.
  ARM: 5929/1: Add checks to detect overlap of memory regions.
  ARM: 5928/1: Change type of VMALLOC_END to unsigned long.
  ARM: 5927/1: Make delimiters of DMA area globally visibly.
  ARM: 5926/1: Add "Virtual kernel memory..." printout.
  ARM: 5920/1: OMAP4: Enable L2 Cache
  ...

Fix up trivial conflict in arch/arm/mach-mx25/clock.c
2010-03-01 09:15:15 -08:00
Liu Yu daf5e27109 KVM: ppc/booke: Set ESR and DEAR when inject interrupt to guest
Old method prematurely sets ESR and DEAR.
Move this part after we decide to inject interrupt,
which is more like hardware behave.

Signed-off-by: Liu Yu <yu.liu@freescale.com>
Acked-by: Hollis Blanchard <hollis@penguinppc.org>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:36:10 -03:00
Liu Yu da15bf436b KVM: PPC E500: fix tlbcfg emulation
commit 55fb1027c1cf9797dbdeab48180da530e81b1c39 doesn't update tlbcfg correctly.
Fix it.

And since guest OS likes 'fixed' hardware,
initialize tlbcfg everytime when guest access is useless.
So move this part to init code.

Signed-off-by: Liu Yu <yu.liu@freescale.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:06 -03:00
Liu Yu d86be077a4 KVM: PPC E500: Add register l1csr0 emulation
Latest kernel start to access l1csr0 to contron L1.
We just tell guest no operation is on going.

Signed-off-by: Liu Yu <yu.liu@freescale.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-03-01 12:36:05 -03:00
Alexander Graf f7adbba1e5 KVM: PPC: Keep SRR1 flags around in shadow_msr
SRR1 stores more information that just the MSR value. It also stores
valuable information about the type of interrupt we received, for
example whether the storage interrupt we just got was because of a
missing htab entry or not.

We use that information to speed up the exit path.

Now if we get preempted before we can interpret the shadow_msr values,
we get into vcpu_put which then calls the MSR handler, which then sets
all the SRR1 information bits in shadow_msr to 0. Great.

So let's preserve the SRR1 specific bits in shadow_msr whenever we set
the MSR. They don't hurt.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:56 -03:00
Alexander Graf 1c0006d8d1 KVM: PPC: Fix initial GPR settings
Commit 7d01b4c3ed2bb33ceaf2d270cb4831a67a76b51b introduced PACA backed vcpu
values. With this patch, when a userspace app was setting GPRs before it was
actually first loaded, the set values get discarded.

This is because vcpu_load loads them from the vcpu backing store that we use
whenever we're not owning the PACA.

That behavior is not really a major problem, because we don't need it for
qemu. Other users (like kvmctl) do have problems with it though, so let's
better do it right.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:55 -03:00
Alexander Graf 180a34d2d3 KVM: PPC: Add support for FPU/Altivec/VSX
When our guest starts using either the FPU, Altivec or VSX we need to make
sure Linux knows about it and sneak into its process switching code
accordingly.

This patch makes accesses to the above parts of the system work inside the
VM.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:52 -03:00
Alexander Graf d5e528136c KVM: PPC: Add helper functions to call real mode loaders
Linux contains quite some bits of code to load FPU, Altivec and VSX lazily for
a task. It calls those bits in real mode, coming from an interrupt handler.

For KVM we better reuse those, so let's wrap a bit of trampoline magic around
them and then we can call them from normal module code.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:52 -03:00
Alexander Graf 4b5c9b7f9b KVM: PPC: Make large pages work
An SLB entry contains two pieces of information related to size:

  1) PTE size
  2) SLB size

The L bit defines the PTE be "large" (usually means 16MB),
SLB_VSID_B_1T defines that the SLB should span 1 GB instead of the
default 256MB.

Apparently I messed things up and just put those two in one box,
shaked it heavily and came up with the current code which handles
large pages incorrectly, because it also treats large page SLB entries
as "1TB" segment entries.

This patch splits those two features apart, making Linux guests boot
even when they have > 256MB.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:50 -03:00
Alexander Graf 25a8a02d26 KVM: PPC: Emulate trap SRR1 flags properly
Book3S needs some flags in SRR1 to get to know details about an interrupt.

One such example is the trap instruction. It tells the guest kernel that
a program interrupt is due to a trap using a bit in SRR1.

This patch implements above behavior, making WARN_ON behave like WARN_ON.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:49 -03:00
Alexander Graf 021ec9c69f KVM: PPC: Call SLB patching code in interrupt safe manner
Currently we're racy when doing the transition from IR=1 to IR=0, from
the module memory entry code to the real mode SLB switching code.

To work around that I took a look at the RTAS entry code which is faced
with a similar problem and did the same thing:

  A small helper in linear mapped memory that does mtmsr with IR=0 and
  then RFIs info the actual handler.

Thanks to that trick we can safely take page faults in the entry code
and only need to be really wary of what to do as of the SLB switching
part.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:49 -03:00
Alexander Graf b4433a7cce KVM: PPC: Implement 'skip instruction' mode
To fetch the last instruction we were interrupted on, we enable DR in early
exit code, where we are still in a very transitional phase between guest
and host state.

Most of the time this seemed to work, but another CPU can easily flush our
TLB and HTAB which makes us go in the Linux page fault handler which totally
breaks because we still use the guest's SLB entries.

To work around that, let's introduce a second KVM guest mode that defines
that whenever we get a trap, we don't call the Linux handler or go into
the KVM exit code, but just jump over the faulting instruction.

That way a potentially bad lwz doesn't trigger any faults and we can later
on interpret the invalid instruction we fetched as "fetch didn't work".

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:48 -03:00
Alexander Graf 7e57cba060 KVM: PPC: Use PACA backed shadow vcpu
We're being horribly racy right now. All the entry and exit code hijacks
random fields from the PACA that could easily be used by different code in
case we get interrupted, for example by a #MC or even page fault.

After discussing this with Ben, we figured it's best to reserve some more
space in the PACA and just shove off some vcpu state to there.

That way we can drastically improve the readability of the code, make it
less racy and less complex.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:48 -03:00
Alexander Graf 992b5b29b5 KVM: PPC: Add helpers for CR, XER
We now have helpers for the GPRs, so let's also add some for CR and XER.

Having them in the PACA simplifies code a lot, as we don't need to care
about where to store CC or not to overflow any integers.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-03-01 12:35:47 -03:00