If we want to stop the tick further idle, we need to be
able to account the cputime without using the tick.
Virtual based cputime accounting solves that problem by
hooking into kernel/user boundaries.
However implementing CONFIG_VIRT_CPU_ACCOUNTING require
low level hooks and involves more overhead. But we already
have a generic context tracking subsystem that is required
for RCU needs by archs which plan to shut down the tick
outside idle.
This patch implements a generic virtual based cputime
accounting that relies on these generic kernel/user hooks.
There are some upsides of doing this:
- This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
if context tracking is already built (already necessary for RCU in full
tickless mode).
- We can rely on the generic context tracking subsystem to dynamically
(de)activate the hooks, so that we can switch anytime between virtual
and tick based accounting. This way we don't have the overhead
of the virtual accounting when the tick is running periodically.
And one downside:
- There is probably more overhead than a native virtual based cputime
accounting. But this relies on hooks that are already set anyway.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
The DDW code uses a eeh_dev struct from the pci_dev. However, this is
not set until eeh_add_device_late is called.
Since pci_bus_add_devices is called before eeh_add_device_late, the PCI
devices are added to the bus, making drivers' probe hooks to be called.
These will call set_dma_mask, which will call the DDW code, which will
require the eeh_dev struct from pci_dev. This would result in a crash,
due to a NULL dereference.
Calling eeh_add_device_late after pci_bus_add_devices would make the
system BUG, because device files shouldn't be added to devices there
were not added to the system. So, a new function is needed to add such
files only after pci_bus_add_devices have been called.
Cc: stable@vger.kernel.org
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds DAWR supoprt to the set_break().
It does both bare metal and PAPR versions of setting the DAWR.
There is still some work we can do to make full use of the watchpoint but that
will come later.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
These are just wrappers around the new set_mode HCALL.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For PR KVM we allow userspace to map 0xc000000000000000. Because
transitioning from userspace to the guest kernel may use the relocated
exception vectors we have to disable relocation on exceptions whenever
PR KVM is active as we cannot trust that address.
This issue does not apply to HV KVM, since changing from a guest to the
hypervisor will never use the relocated exception vectors.
Currently the hypervisor interface only allows us to toggle relocation
on exceptions on a partition wide scope, so we need to globally disable
relocation on exceptions when the first PR KVM instance is started and
only re-enable them when all PR KVM instances have been destroyed.
It's a bit heavy handed, but until the hypervisor gives us a lightweight
way to toggle relocation on exceptions on a single thread it's only real
option.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Finally remove the two level TOC and build with -mcmodel=medium.
Unfortunately we can't build modules with -mcmodel=medium due to
the tricks the kernel module loader plays with percpu data:
# -mcmodel=medium breaks modules because it uses 32bit offsets from
# the TOC pointer to create pointers where possible. Pointers into the
# percpu data area are created by this method.
#
# The kernel module loader relocates the percpu data section from the
# original location (starting with 0xd...) to somewhere in the base
# kernel percpu data space (starting with 0xc...). We need a full
# 64bit relocation for this to work, hence -mcmodel=large.
On older kernels we fall back to the two level TOC (-mminimal-toc)
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The third argument for of_get_property() is a pointer, hence pass
NULL instead of 0.
Signed-off-by: Tushar Behera <tushar.behera@linaro.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch actually hooks up doorbell interrupts on POWER8:
- Select the PPC_DOORBELL Kconfig option from PPC_PSERIES
- Add the doorbell CPU feature bit to POWER8
- We define a new pSeries_cause_ipi_mux() function that issues a
doorbell interrupt if the recipient is another thread within the same
core as the sender. If the recipient is in a different core it falls
back to using XICS to deliver the IPI as before.
- During pSeries_smp_probe() at boot, we check if doorbell interrupts
are supported. If they are we set the cause_ipi function pointer to
the above mentioned function, otherwise we leave it as whichever XICS
cause_ipi function was determined by xics_smp_probe().
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Tested-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently we search for the best_energy hcall using a custom function. Move
this to using the firmware_feature_table.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
cc: Linux PPC dev <linuxppc-dev@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This allows firmware_features_table names to add a '*' at the end so that only
partial strings are matched.
When a '*' is added, only upto the '*' is matched when setting firmware feature
bits.
This is useful for the matching best energy feature.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
cc: Linux PPC dev <linuxppc-dev@ozlabs.org>
Reviewed-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The pseries CPU hotplug code uses cede_processor without properly
synchronizing the SW and HW interrupt enable state. This fixes
it using the same helpers that were written for the idle code.
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
=======================
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
markings need to be removed.
This change removes the use of __devinit, __devexit_p, __devinitdata,
__devinitconst, and __devexit from these drivers.
Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.
Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pull powerpc update from Benjamin Herrenschmidt:
"The main highlight is probably some base POWER8 support. There's more
to come such as transactional memory support but that will wait for
the next one.
Overall it's pretty quiet, or rather I've been pretty poor at picking
things up from patchwork and reviewing them this time around and Kumar
no better on the FSL side it seems..."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (73 commits)
powerpc+of: Rename and fix OF reconfig notifier error inject module
powerpc: mpc5200: Add a3m071 board support
powerpc/512x: don't compile any platform DIU code if the DIU is not enabled
powerpc/mpc52xx: use module_platform_driver macro
powerpc+of: Export of_reconfig_notifier_[register,unregister]
powerpc/dma/raidengine: add raidengine device
powerpc/iommu/fsl: Add PAMU bypass enable register to ccsr_guts struct
powerpc/mpc85xx: Change spin table to cached memory
powerpc/fsl-pci: Add PCI controller ATMU PM support
powerpc/86xx: fsl_pcibios_fixup_bus requires CONFIG_PCI
drivers/virt: the Freescale hypervisor driver doesn't need to check MSR[GS]
powerpc/85xx: p1022ds: Use NULL instead of 0 for pointers
powerpc: Disable relocation on exceptions when kexecing
powerpc: Enable relocation on during exceptions at boot
powerpc: Move get_longbusy_msecs into hvcall.h and remove duplicate function
powerpc: Add wrappers to enable/disable relocation on exceptions
powerpc: Add set_mode hcall
powerpc: Setup relocation on exceptions for bare metal systems
powerpc: Move initial mfspr LPCR out of __init_LPCR
powerpc: Add relocation on exception vector handlers
...
* Introduction of device PM QoS flags.
* ACPI device power management update allowing subsystems other than
PCI to use it more easily.
* ACPI device enumeration rework allowing additional kinds of devices
to be enumerated via ACPI. From Mika Westerberg, Adrian Hunter,
Mathias Nyman, Andy Shevchenko, and Rafael J. Wysocki.
* ACPICA update to version 20121018 from Bob Moore and Lv Zheng.
* ACPI memory hotplug update from Wen Congyang and Yasuaki Ishimatsu.
* Introduction of acpi_handle_<level>() messaging macros and ACPI-based CPU
hot-remove support from Toshi Kani.
* ACPI EC updates from Feng Tang.
* cpufreq updates from Viresh Kumar, Fabio Baltieri and others.
* cpuidle changes to quickly notice governor prediction failure from
Youquan Song.
* Support for using multiple cpuidle drivers at the same time and cpuidle
cleanups from Daniel Lezcano.
* devfreq updates from Nishanth Menon and others.
* cpupower update from Thomas Renninger.
* Fixes and small cleanups all over the place.
--
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABAgAGBQJQxxevAAoJEKhOf7ml8uNsHacQAK2xoQozDddPBAaTCf1OWW/G
J4E2qMn+gy4AtFmQ0/xeZVvvylOCn9eu/kv3QJ/bFlVoUsqTgfPwQBjT6HjF1Acn
7yVFdr9H/tn2wi9nW2Gv6Tl2Hr/kFLpo0f5Kd40Z+P8SV5sKX5ktJcVpJ/I/P4Vh
Qw2nWtj7hQktZDERzgG4ZpMbQToNhbLhbKWB9ad3/XQSSA9JkfgvBFgrTEGHcZD5
bwsggjNdOAWNGeDdzRsQSDn0Alld5ILVdSJ5xKimO1O70WvKc7fqA1IdYRIeKL90
yz4bcoYKzl9iktlw8+x5o1U9mrc8TFV5p4+zV+t5Z6pzS/J3kWvnsW4zu9sCrxFv
Wic3SKyiem7s2dxrYyj4ZXAci3GK4ouRTrCLqk7/00tEGdwAQD1ZNfsUJp6jKayz
FvtZUgItcOyrlQ6B4nh951OY6dI3AUYJ2NuWWNr5NZkgVAvQGV8zTGOImbeVeL2+
pMiw14zScO3ylYilVcjTKDDMj2sDZ68mw5PIcbmksvWsCLo26jDBVDtLVmtYWyd4
ek3WnOrQZr0R3agvOLLssMKXompvpP+N4Klf4rV+GtqGsWtHryYKys2Laju9FwFj
yYLchxYlxhGTzqq8LjF90HDL0TWpPe6cPi+B5ow9g/SXLexbMKNQGhv3Jovm2yR3
j54tKBWy7e9AAYEDPirX
=6OEP
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-for-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
- Introduction of device PM QoS flags.
- ACPI device power management update allowing subsystems other than
PCI to use it more easily.
- ACPI device enumeration rework allowing additional kinds of devices
to be enumerated via ACPI. From Mika Westerberg, Adrian Hunter,
Mathias Nyman, Andy Shevchenko, and Rafael J. Wysocki.
- ACPICA update to version 20121018 from Bob Moore and Lv Zheng.
- ACPI memory hotplug update from Wen Congyang and Yasuaki Ishimatsu.
- Introduction of acpi_handle_<level>() messaging macros and ACPI-based
CPU hot-remove support from Toshi Kani.
- ACPI EC updates from Feng Tang.
- cpufreq updates from Viresh Kumar, Fabio Baltieri and others.
- cpuidle changes to quickly notice governor prediction failure from
Youquan Song.
- Support for using multiple cpuidle drivers at the same time and
cpuidle cleanups from Daniel Lezcano.
- devfreq updates from Nishanth Menon and others.
- cpupower update from Thomas Renninger.
- Fixes and small cleanups all over the place.
* tag 'pm+acpi-for-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (196 commits)
mmc: sdhci-acpi: enable runtime-pm for device HID INT33C6
ACPI: add Haswell LPSS devices to acpi_platform_device_ids list
ACPI: add documentation about ACPI 5 enumeration
pnpacpi: fix incorrect TEST_ALPHA() test
ACPI / PM: Fix header of acpi_dev_pm_detach() in acpi.h
ACPI / video: ignore BIOS initial backlight value for HP Folio 13-2000
ACPI : do not use Lid and Sleep button for S5 wakeup
ACPI / PNP: Do not crash due to stale pointer use during system resume
ACPI / video: Add "Asus UL30VT" to ACPI video detect blacklist
ACPI: do acpisleep dmi check when CONFIG_ACPI_SLEEP is set
spi / ACPI: add ACPI enumeration support
gpio / ACPI: add ACPI support
PM / devfreq: remove compiler error with module governors (2)
cpupower: IvyBridge (0x3a and 0x3e models) support
cpupower: Provide -c param for cpupower monitor to schedule process on all cores
cpupower tools: Fix warning and a bug with the cpu package count
cpupower tools: Fix malloc of cpu_info structure
cpupower tools: Fix issues with sysfs_topology_read_file
cpupower tools: Fix minor warnings
cpupower tools: Update .gitignore for files created in the debug directories
...
Bug fixes, little cleanups, and documentation changes. The most invasive
thing here touches a bunch of the arch directories to use a common build
rule for .dtb files. There are no major changes to functionality here
other than a ew new helper functions.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQx2/sAAoJEEFnBt12D9kB8aYP/iInsIXDi6c0sGRNgYoFBZUH
pGZH/PGcZjDD2WPUOos56KTDsWt2kIZm/4ck1mBmXMD5pKM4znIQyMTkhKFiyO0l
CARwCjyRuaHQJHt88Pcaqhk34HGLgV7ntImMHwpIn0mNbjx7hruk9aU9Jqdcd32j
2ANbUYU9ikgq0e9s/+xQfB6QZxH1zecQkMalhQWvihtVybpG3xW67IROZv/69uL0
OtHZZJ7nCwXcEb84rcEHU781tO0CoQhr/pId7DirQEKaD8oNLHsejFEgGEFfFrvZ
eFmJP/YmZh7Ll+NNTMHnQwyDNa2LFuLazbjaqCqUHQgcw9daFFeP5ga3Uqp7WZbU
4kIxBQJ3byELnttFVQbsMQag+IAgmgfp8YdJFk3VTJDJKwpMJAO1sQeoEUHnnWAr
cyyCzROaFt1hUkhRiXso+IYNLvb60o0/2NJRTiObPdljy9OSXW6O3wFEGk9F/6u0
jgl9tisfLUL0orKnJ5tR3kbHaJKQd+6HgBRJNiuB+9FYO43j1cY/sYggSNCkiCMy
RvlGYDCJ+lfmZdZ4T+QYLpjsOenFosV0JZGU6MVlZmEGzTtLyhALZQKNHuxLw1Nc
oTe8Fx5aJrWEYe/HGqm4C43DqEiQUCmZ9NPae0JyfocmjuwARSnepvWe1XZNa/8t
aNfK6UkTO/IL3zOtOHfd
=LB3n
-----END PGP SIGNATURE-----
Merge tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux-2.6
Pull device tree changes from Grant Likely:
"Here are the DT changes I've got queued up for v3.8. As described
below, there are a lot of bug fixes here and documentation updates but
nothing major:
Bug fixes, little cleanups, and documentation changes. The most
invasive thing here touches a bunch of the arch directories to use a
common build rule for .dtb files. There are no major changes to
functionality here other than a few new helper functions."
* tag 'devicetree-for-linus' of git://git.secretlab.ca/git/linux-2.6: (34 commits)
arm64: Fix the dtbs target building
mtd: nand: davinci: fix the binding documentation
rtc: rtc-mv: Add the device tree binding documentation
devicetree/bindings: Move gpio-leds binding into leds directory
of/vendor-prefixes: add Imagination Technologies
microblaze: use new common dtc rule
c6x: use new common dtc rule
openrisc: use new common dtc rule
arm64: Add dtbs target for building all the enabled dtb files
arm64: use new common dtc rule
ARM: dt: change .dtb build rules to build in dts directory
kbuild: centralize .dts->.dtb rule
Fix build when CONFIG_W1_MASTER_GPIO=m b exporting "allnodes"
of/spi: Honour "status=disabled" property of device
of_mdio: Honour "status=disabled" property of device
of_i2c: Honour "status=disabled" property of device
powerpc: Fix fallout from device_node->name constification
of: add 'const' for of_parse_phandle parameter *np
Documentation: correct of_platform_populate() argument list
script: dtc: clean generated files
...
* pm-cpuidle:
cpuidle: Measure idle state durations with monotonic clock
cpuidle: fix a suspicious RCU usage in menu governor
cpuidle: support multiple drivers
cpuidle: prepare the cpuidle core to handle multiple drivers
cpuidle: move driver checking within the lock section
cpuidle: move driver's refcount to cpuidle
cpuidle: fixup device.h header in cpuidle.h
cpuidle / sysfs: move structure declaration into the sysfs.c file
cpuidle: Get typical recent sleep interval
cpuidle: Set residency to 0 if target Cstate not enter
cpuidle: Quickly notice prediction failure in general case
cpuidle: Quickly notice prediction failure for repeat mode
cpuidle / sysfs: move kobj initialization in the syfs file
cpuidle / sysfs: change function parameter
Many cpuidle drivers measure their time spent in an idle state by
reading the wallclock time before and after idling and calculating the
difference. This leads to erroneous results when the wallclock time gets
updated by another processor in the meantime, adding that clock
adjustment to the idle state's time counter.
If the clock adjustment was negative, the result is even worse due to an
erroneous cast from int to unsigned long long of the last_residency
variable. The negative 32 bit integer will zero-extend and result in a
forward time jump of roughly four billion milliseconds or 1.3 hours on
the idle state residency counter.
This patch changes all affected cpuidle drivers to either use the
monotonic clock for their measurements or make use of the generic time
measurement wrapper in cpuidle.c, which was already working correctly.
Some superfluous CLIs/STIs in the ACPI code are removed (interrupts
should always already be disabled before entering the idle function, and
not get reenabled until the generic wrapper has performed its second
measurement). It also removes the erroneous cast, making sure that
negative residency values are applied correctly even though they should
not appear anymore.
Signed-off-by: Julius Werner <jwerner@chromium.org>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Tested-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Merge my own merge branch to get various fixes from there
and upstream, especially the hvc console tty refcouting fixes
which which testing is quite a bit harder...
While the EEH does recovery on the specific PE that has PCI errors,
the PCI devices belonging to the PE will be removed and the PE will
be marked as invalid since we still need the information stored in
the PE. We only invalidate the PE when it doesn't have associated
EEH devices and valid child PEs. However, the code used to check
that is wrong. The patch fixes that.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The new EEH code introduced a small regression, if the EEH PEs
are missin (which happens currently in qemu for example), it
will deref a NULL pointer in the MSI code.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Neither of these should ever be changed once set. Make them const and
fix up the users that try to modify it in-place. In one case
kmalloc+memcpy is replaced with kstrdup() to avoid modifying the string.
Build tested with defconfigs on ARM, PowerPC, Sparc, MIPS, x86 among
others.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Julian Calaby <julian.calaby@gmail.com>
Since we don't know if they new kernel we are kexecing into has been
built to support relocation on exceptions, we disable them before we
kexec.
We do NOT disable them if we are execing a kdump kernel, because we
want to change as little state as possible and it is likely that we are
execing ourselves and will be able to handle them anyway.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We currently do this synchronously at boot from setup_arch. On a large
system this could hypothetically take a little while to complete, so
currently we will give up if we are asked to wait for more than a second
in total.
If we actually start hitting that timeout in practice we can always move
this code into a kernel thread to take care of it in the background.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
These wrappers hide the parameters that have to be passed to H_SET_MODE
to enable/disable relocation on during exceptions.
As noted in the comments, since these have partition wide scope, they
may take some time to complete and must be periodically retried until
H_SUCCESS is returned.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This new hcall in POWER8 is used to set various resource mode registers.
eg. it can set address translation mode on interrupt (note: partition wide
scope)
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Remove the pSeries_reconfig.h header file. At this point there is only one
definition in the file, pSeries_coalesce_init(), which can be
moved to rtas.h.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Rename the prom_*_property routines of the generic OF code to of_*_property.
This brings them in line with the naming used by the rest of the OF code.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Acked-by: Geoff Levand <geoff@infradead.org>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch moves the notification chain for updates to the device tree
from the powerpc/pseries code to the base OF code. This makes this
functionality available to all architectures.
Additionally the notification chain is updated to allow notifications
for property add/remove/update. To make this work a pointer to a new
struct (of_prop_reconfig) is passed to the routines in the notification chain.
The of_prop_reconfig property contains a pointer to the node containing the
property and a pointer to the property itself. In the case of property
updates, the property pointer refers to the new property.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When adding or removing a device tree node we should also update
the device tree in /proc/device-tree. This action is already done in the
generic OF code for adding/removing properties of a node. This patch adds
this functionality for nodes.
Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Earlier without cpuidle framework on pseries, the native arch
idle routine comprised of both snooze and nap
states. smt_snooze_delay variable was used to delay
the idle process entry to deeper idle state like nap.
With the coming of cpuidle, this arch specific idle was replaced
by two different idle routines, one for supporting snooze and other
for nap. This enabled addition of more
low level idle states on pseries in the future.
On adopting the generic cpuidle framework for POWER systems,
the decision of which idle state to choose from, given a predicted
idle time is taken by the menu governor based on
target_residency and exit_latency of the idle states.
target_residency is the minimum time to be resident in that idle state.
Exit_latency is time taken to exit out of idle state.
Deeper the idle state, both the target residency and exit latency
would be higher.
In the current design, smt_snooze_delay is used as target_residency
for the snooze state which is incorrect, as it is not the
minimum but the maximum duration to be in snooze state.
This would result in the governor in taking bad decision,
as presently target_residency of nap < target_residency of snooze
inspite of nap being deeper idle state.
This patch aims to fix this problem by replacing the smt_snooze_delay loop
in snooze state, with the need_resched() as the governor is aware of
entry and exit of various idle transitions based on which
next idle time prediction.
The governor is intelligent enough to determine the idle state the needs to
be transitioned to and maintains a whole of heuristics including
io load, previous idle states predictions etc for the same, based on
which idle state entry decision is taken.
With this fix, of setting target_residency of snooze to 0
nap to smt_snooze_delay
if the predicted idle time is less
than smt_snooze_delay (target_residency of nap)
value governor would pick snooze state, else nap. This adhers to the
previous native idle design.
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
smt_snooze_delay was designed to delay idle loop's nap entry
in the native idle code before it got ported over to use as part of
the cpuidle framework.
A -ve value assigned to smt_snooze_delay should result in
busy looping, in other words disabling the entry to nap state.
- https://lists.ozlabs.org/pipermail/linuxppc-dev/2010-May/082450.html
This particular functionality can be achieved currently by
echo 1 > /sys/devices/system/cpu/cpu*/state1/disable
but it is broken when one assigns -ve value to the smt_snooze_delay
variable either via sysfs entry or ppc64_cpu util.
This patch aims to fix this, by disabling nap state when smt_snooze_delay
variable is set to -ve value.
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Remove the redundant target residency initialisation in pseries_cpuidle_driver_init().
This is currently over-writing the residency time updated as part of the static
table, resulting in all the idle states having the same target
residency of 100us which is incorrect. This may result in the menu governor making
wrong state decisions.
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull third pile of kernel_execve() patches from Al Viro:
"The last bits of infrastructure for kernel_thread() et.al., with
alpha/arm/x86 use of those. Plus sanitizing the asm glue and
do_notify_resume() on alpha, fixing the "disabled irq while running
task_work stuff" breakage there.
At that point the rest of kernel_thread/kernel_execve/sys_execve work
can be done independently for different architectures. The only
pending bits that do depend on having all architectures converted are
restrictred to fs/* and kernel/* - that'll obviously have to wait for
the next cycle.
I thought we'd have to wait for all of them done before we start
eliminating the longjump-style insanity in kernel_execve(), but it
turned out there's a very simple way to do that without flagday-style
changes."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
alpha: switch to saner kernel_execve() semantics
arm: switch to saner kernel_execve() semantics
x86, um: convert to saner kernel_execve() semantics
infrastructure for saner ret_from_kernel_thread semantics
make sure that kernel_thread() callbacks call do_exit() themselves
make sure that we always have a return path from kernel_execve()
ppc: eeh_event should just use kthread_run()
don't bother with kernel_thread/kernel_execve for launching linuxrc
alpha: get rid of switch_stack argument of do_work_pending()
alpha: don't bother passing switch_stack separately from regs
alpha: take SIGPENDING/NOTIFY_RESUME loop into signal.c
alpha: simplify TIF_NEED_RESCHED handling
Followups to d760afd4d2 ("memory-hotplug: suppress "Trying to free
nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" warning").
- use unsigned long type, as overflows are conceivable
- rename `i' to the less-misleading and more informative `section'
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When our x86 box calls __remove_pages(), release_mem_region() shows many
warnings. And x86 box cannot unregister iomem_resource.
"Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>"
release_mem_region() has been changed to be called in each
PAGES_PER_SECTION by commit de7f0cba96 ("memory hotplug: release
memory regions in PAGES_PER_SECTION chunks"). Because powerpc registers
iomem_resource in each PAGES_PER_SECTION chunk. But when I hot add
memory on x86 box, iomem_resource is register in each _CRS not
PAGES_PER_SECTION chunk. So x86 box unregisters iomem_resource.
The patch fixes the problem.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kernel crash was reported by Alexy. He was testing some feature
with private kernel, in which Alexy added some code in pci_pm_reset()
to read the CSR after writting it. The bug could be reproduced on
Fiber Channel card (Fibre Channel: Emulex Corporation Saturn-X:
LightPulse Fibre Channel Host Adapter (rev 03)) by the following
commands.
# echo 1 > /sys/devices/pci0004:01/0004:01:00.0/reset
# rmmod lpfc
# modprobe lpfc
The history behind the test case is that those additional config
space reading operations in pci_pm_reset() would cause EEH error,
but we didn't detect EEH error until "modprobe lpfc". For the case,
all the PCI devices on PCI bus (0004:01) were removed and added after
PE reset. Then the EEH devices would be figured out again based on
the OF nodes. Unfortunately, there were some child OF nodes under
PCI device (0004:01:00.0), but they didn't have attached PCI_DN since
they're invisible from PCI domain. However, we were still trying to
convert OF node to EEH device without checking on the attached PCI_DN.
Eventually, it caused the kernel crash as follows:
Unable to handle kernel paging request for data at address 0x00000030
Faulting instruction address: 0xc00000000004d888
cpu 0x0: Vector: 300 (Data Access) at [c000000fc797b950]
pc: c00000000004d888: .eeh_add_device_tree_early+0x78/0x140
lr: c00000000004d880: .eeh_add_device_tree_early+0x70/0x140
sp: c000000fc797bbd0
msr: 8000000000009032
dar: 30
dsisr: 40000000
current = 0xc000000fc78d9f70
paca = 0xc00000000edb0000 softe: 0 irq_happened: 0x00
pid = 2951, comm = eehd
enter ? for help
[c000000fc797bc50] c00000000004d848 .eeh_add_device_tree_early+0x38/0x140
[c000000fc797bcd0] c00000000004d848 .eeh_add_device_tree_early+0x38/0x140
[c000000fc797bd50] c000000000051b54 .pcibios_add_pci_devices+0x34/0x190
[c000000fc797bde0] c00000000004fb10 .eeh_reset_device+0x100/0x160
[c000000fc797be70] c0000000000502dc .eeh_handle_event+0x19c/0x300
[c000000fc797bf00] c000000000050570 .eeh_event_handler+0x130/0x1a0
[c000000fc797bf90] c000000000020138 .kernel_thread+0x54/0x70
The patch changes of_node_to_eeh_dev() and just returns NULL if the
passed OF node doesn't have attached PCI_DN.
Cc: stable@vger.kernel.org
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The EEH core is talking with the PCI device driver to determine the
action (purely reset, or PCI device removal). During the period, the
driver might be unloaded and in turn causes kernel crash as follows:
EEH: Detected PCI bus error on PHB#4-PE#10000
EEH: This PCI device has failed 3 times in the last hour
lpfc 0004:01:00.0: 0:2710 PCI channel disable preparing for reset
Unable to handle kernel paging request for data at address 0x00000490
Faulting instruction address: 0xd00000000e682c90
cpu 0x1: Vector: 300 (Data Access) at [c000000fc75ffa20]
pc: d00000000e682c90: .lpfc_io_error_detected+0x30/0x240 [lpfc]
lr: d00000000e682c8c: .lpfc_io_error_detected+0x2c/0x240 [lpfc]
sp: c000000fc75ffca0
msr: 8000000000009032
dar: 490
dsisr: 40000000
current = 0xc000000fc79b88b0
paca = 0xc00000000edb0380 softe: 0 irq_happened: 0x00
pid = 3386, comm = eehd
enter ? for help
[c000000fc75ffca0] c000000fc75ffd30 (unreliable)
[c000000fc75ffd30] c00000000004fd3c .eeh_report_error+0x7c/0xf0
[c000000fc75ffdc0] c00000000004ee00 .eeh_pe_dev_traverse+0xa0/0x180
[c000000fc75ffe70] c00000000004ffd8 .eeh_handle_event+0x68/0x300
[c000000fc75fff00] c0000000000503a0 .eeh_event_handler+0x130/0x1a0
[c000000fc75fff90] c000000000020138 .kernel_thread+0x54/0x70
1:mon>
The patch increases the reference of the corresponding driver modules
while EEH core does the negotiation with PCI device driver so that the
corresponding driver modules can't be unloaded during the period and
we're safe to refer the callbacks.
Cc: stable@vger.kernel.org
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We have missed lots of situations where the PE hierarchy tree need
protection through the EEH global mutex. The patch fixes that for
those public APIs implemented in eeh_pe.c. The only exception is
eeh_pe_restore_bars() because it calls eeh_pe_dev_traverse(), which
has been protected by the mutex.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Function eeh_rmv_from_parent_pe() could be called by the path of
either normal PCI hotplug, or EEH recovery. For the former case,
we need purge the corresponding PE on removal of the associated
PE bus.
The patch tries to cover that by passing more information to function
pcibios_remove_pci_devices() so that we know if the corresponding PE
needs to be purged or be marked as "invalid".
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When EEH error happens on the PE whose PCI devices don't have
attached drivers. In function eeh_handle_event(), the default
value PCI_ERS_RESULT_NONE will be returned after iterating all
drivers of those PCI devices belonging to the PE. Actually, we
don't have installed drivers for the PCI devices. Under the
circumstance, we will remove the corresponding PCI bus of the PE,
including the associated EEH devices and PE instance. However,
we still need the information stored in the PE instance to do PE
reset after that. So it's unsafe to free the PE instance.
The patch introduces EEH_PE_INVALID type PE to address the issue.
When the PCI bus and the corresponding attached EEH devices are
removed, we will mark the PE as EEH_PE_INVALID. At later point,
the PE will be changed to EEH_PE_DEVICE or EEH_PE_BUS when the
corresponding EEH devices are attached again.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This patch convert different functions to take virtual page number
instead of virtual address. Virtual page number is virtual address
shifted right by VPN_SHIFT (12) bits. This enable us to have an
address range of upto 76 bits.
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently we mark the DABRX to interrupt on all matches
(hypervisor/kernel/user and then filter in software. We can be a lot
smarter now that we can set the DABRX dynamically.
This sets the DABRX based on the flags passed by the user.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Rework set_dabr to take a DABRX value as well.
Both the pseries and PS3 hypervisors do some checks on the DABRX
values that are passed in the hcall. This patch stops bogus values
from being passed to hypervisor. Also, in the case where we are
clearing the breakpoint, where DABR and DABRX are zero, we modify the
DABRX value to make it valid so that the hcall won't fail.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch does cleanup on EEH PCI address cache based on the fact
EEH core is the only user of the component.
* Cleanup on function names so that they all have prefix
"eeh" and looks more short.
* Function printk() has been replaced with pr_debug() or
pr_warning() accordingly.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The idea comes from Benjamin Herrenschmidt. The eeh cache helps
fetching the pci device according to the given I/O address. Since
the eeh cache is serving for eeh, it's reasonable for eeh cache
to trace eeh device except pci device.
The patch make eeh cache to trace eeh device. Also, the major
eeh entry function eeh_dn_check_failure has been renamed to
eeh_dev_check_failure since it will take eeh device as input
parameter.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
While EEH module is installed, PCI devices is checked one by one
to see if it supports eeh. On different platforms, the PCI devices
are referred through different ways when the EEH module is loaded.
For example, on pSeries platform, that is done by OF node. However,
we would do that by real PCI devices (struct pci_dev) on PowerNV
platform in future. So we needs some mechanism to differentiate
those cases by classifying them to probe modes, either from OF
nodes or real PCI devices.
The patch implements the support to eeh probe mode. Also, the
EEH on pSeries has set it into EEH_PROBE_MODE_DEVTREE. That means
the probe will be done based on OF nodes on pSeries platform.
In addition, On pSeries platform, it's done by OF nodes. The patch
moves the the probe function from EEH core to platform dependent
backend and some cleanup applied.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch removes the eeh related statistics for eeh device since
they have been maintained by the corresponding eeh PE. Also, the
flags used to trace the state of eeh device and PE have been reworked
for a little bit.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch reworks the current implementation so that the eeh errors
will be handled basing on PE instead of eeh device.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Once eeh error is found, eeh event will be created and put it into
the global linked list. At the mean while, kernel thread will be
started to process it. The handler for the kernel thread originally
was eeh device sensitive.
The patch reworks the handler of the kernel thread so that it's PE
sensitive.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch implements reset based on PE instead of eeh device. Also,
The functions used to retrieve the reset type, either hot or fundamental
reset, have been reworked for a little bit. More specificly, it's
implemented based the the eeh device traverse function.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch refactors the original implementation in order to enable
I/O and retrieve EEH log based on PE.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch introduces the function to traverse the devices of the
specified PE and its child PEs. Also, the restore on device bars
is implemented based on the traverse function.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Originally, all the EEH operations were implemented based on OF node.
Actually, it explicitly breaks the rules that the operation target
is PE instead of device. Therefore, the patch makes all the operations
based on PE instead of device.
Unfortunately, the backend for config space has to be kept as original
because it doesn't depend on PE.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There're 2 conditions to trigger EEH error detection: invalid value
returned from reading I/O or config space. On each case, the function
eeh_dn_check_failure will be called to initialize EEH event and put
it into the poll for further processing.
The patch changes the function for a little bit so that the EEH error
will be traced based on PE instead of EEH device any more. Also, the
function eeh_find_device_pe() has been removed since the eeh device
is tracing the PE by struct eeh_dev::pe.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since we've introduced dedicated struct to trace individual PEs,
it's reasonable to trace its state through the dedicated struct
instead of using "eeh_dev" any more.
The patches implements the state tracing based on PE. It's notable
that the PE state will be applied to the specified PE as well as
its child PEs. That complies with the rule that problematic parent
PE will prevent those child PEs from working properly.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The original implementation builds EEH event based on EEH device.
We already had dedicated struct to depict PE. It's reasonable to
build EEH event based on PE.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
During PCI hotplug and EEH recovery, the PE hierarchy tree might be
changed due to the PCI topology changes. At later point when the
PCI device is added, the PE will be created dynamically again.
The patch introduces new function to remove EEH devices from the
associated PE. That also can cause that the parent PE is removed
from the PE tree if the parent PE doesn't include valid EEH devices
and child PEs.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch creates PEs and associated the newly created PEs with
it parent/silbing as well as EEH devices. It would become more
straight to trace EEH errors and recover them accordingly.
Once the EEH functionality on one PCI IOA has been enabled, we
tries to create PE against it. If there's existing PE, to which
the current PCI IOA should be attached, the existing PE will be
converted from "device" type to "bus" type accordingly.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch implements searching PE based on the following
requirements:
* Search PE according to PE address, which is traditional
PE address that is composed of PCI bus/device/function
number, or unified PE address assigned by firmware or
platform.
* Search parent PE according to the given EEH device. It's
useful when creating new PE and put it into right position.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
For one particular PE, it's only meaningful in the ancestor PHB
domain. Therefore, each PHB should have its own PE hierarchy tree
to trace those PEs created against the PHB.
The patch creates PEs for the PHBs and put those PEs into the
global link list traced by "eeh_phb_pe". The link list of PEs
would be first level of overall PE hierarchy tree across the
system.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch introduces global mutex for EEH so that the core data
structures can be protected by that. Also, 2 inline functions
are exported for that: eeh_lock() and eeh_unlock().
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch adds more logs to EEH initialization functions for
debugging purpose. Also, the machine type (pSeries) is checked
in the platform initialization to assure it's the correct platform
to invoke it.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The EEH initialization functions have been postponed until slab/slub
are ready. So we use slab/slub to allocate the memory chunks for newly
creatd EEH devices. That would save lots of memory.
The patch also does cleanup to replace "kmalloc" with "kzalloc" so
that we needn't clear the allocated memory chunk explicitly.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, we have 3 phases for EEH initialization on pSeries platform.
All of them are done through builtin functions: platform initialization,
EEH device creation, and EEH subsystem enablement. All of them are done
no later than ppc_md.setup_arch. That means that the slab/slub isn't ready
yet, so we have to allocate memory chunks on basis of PAGE_SIZE for those
dynamically created EEH devices. That's pretty expensive.
In order to utilize slab/slub for memory allocation, we have to move the EEH
initialization functions around, but all of them should be called after slab
is ready.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We never use the XDABR hcall since we check for DABR hcall first.
XDABR syscall is better since it allows us to also set the DABRX.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The pseries firmware currently refuses any non power of two MSI-X
request. Unfortunately most network drivers end up asking for that
because they want a power of two for RX queues and one or two extra
for everything else.
This patch rounds up the firmware request to the next power of two
if the quota allows it. If this fails we fall back to using the
original request size.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
It's empty now, apart from other includes.
Fixup a few files that were getting things via this header.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
These days they are just __va() and __pa() respectively.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Host bridge hotplug
- Add MMCONFIG support for hot-added host bridges (Jiang Liu)
Device hotplug
- Move fixups from __init to __devinit (Sebastian Andrzej Siewior)
- Call FINAL fixups for hot-added devices, too (Myron Stowe)
- Factor out generic code for P2P bridge hot-add (Yinghai Lu)
- Remove all functions in a slot, not just those with _EJx (Amos Kong)
Dynamic resource management
- Track bus number allocation (struct resource tree per domain) (Yinghai Lu)
- Make P2P bridge 1K I/O windows work with resource reassignment (Bjorn Helgaas, Yinghai Lu)
- Disable decoding while updating 64-bit BARs (Bjorn Helgaas)
Power management
- Add PCIe runtime D3cold support (Huang Ying)
Virtualization
- Add VFIO infrastructure (ACS, DMA source ID quirks) (Alex Williamson)
- Add quirks for devices with broken INTx masking (Jan Kiszka)
Miscellaneous
- Fix some PCI Express capability version issues (Myron Stowe)
- Factor out some arch code with a weak, generic, pcibios_setup() (Myron Stowe)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJQBy+9AAoJEPGMOI97Hn6zOpQP+wVFvA7pcteFj6HPs5nTq2Hc
55oeRqCO0wBHoFMCKB0AjeTATjqxi9OhcjaiVrZejxNyWKC9MnrXuunpQ0l/hCbR
M/TK+BCelfX2FU4eXNf+TBCCcOhOVWqQft9Gm6nYKwX8Y0msRVCceI4WwhZgSwtI
vdtmnqlwolscdnq+8ThsnvUMtwkN0gExmn2FJRl6EoEgG0DTqhMkZ83uA+NPBhvv
I+g0XbA6haaZph2nnSYR0hIW4Q7JkT/LgA6uVAQxamctwxLol7xxsjCRnfqrulkf
kaRr2fAgBXfmaOIltro4UkXrCM52ZSyggCDfExHp6mWGPKMjE5ZcyK1YbGfmmumk
DS3t1S0eBdDJXrnf9l/Yb8e95dQxRCYKelKzr1rTD9QAXsInE8rC40hvhfFaTa4s
nZYRTz0SKv6coQihqaOR7shx1DNomLFk7jndaWEElfl9/cT/nQnZ8XLfVMzkJNNB
Y4SM6zkiIaCL0aiSEE16MqVjmODYRjbURLYzQIrqr2KJQg8X6XjIRojQLjL6xEgA
22ry2ZRPhqO68g7aLqvixiSDaTp0Z0Vw+JmgjtBqvkokwZcGQtm4umkpAdOi+Es8
3bJaMY7ZUpDX53FE8iyP6AnmR/1k19rC1gNnNq/syWyjtYOYJ9i3QCTafFgvE1VC
5coQ1L5tByHvpzK5PHwf
=oo/A
-----END PGP SIGNATURE-----
Merge tag 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI changes from Bjorn Helgaas:
"Host bridge hotplug:
- Add MMCONFIG support for hot-added host bridges (Jiang Liu)
Device hotplug:
- Move fixups from __init to __devinit (Sebastian Andrzej Siewior)
- Call FINAL fixups for hot-added devices, too (Myron Stowe)
- Factor out generic code for P2P bridge hot-add (Yinghai Lu)
- Remove all functions in a slot, not just those with _EJx (Amos
Kong)
Dynamic resource management:
- Track bus number allocation (struct resource tree per domain)
(Yinghai Lu)
- Make P2P bridge 1K I/O windows work with resource reassignment
(Bjorn Helgaas, Yinghai Lu)
- Disable decoding while updating 64-bit BARs (Bjorn Helgaas)
Power management:
- Add PCIe runtime D3cold support (Huang Ying)
Virtualization:
- Add VFIO infrastructure (ACS, DMA source ID quirks) (Alex
Williamson)
- Add quirks for devices with broken INTx masking (Jan Kiszka)
Miscellaneous:
- Fix some PCI Express capability version issues (Myron Stowe)
- Factor out some arch code with a weak, generic, pcibios_setup()
(Myron Stowe)"
* tag 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (122 commits)
PCI: hotplug: ensure a consistent return value in error case
PCI: fix undefined reference to 'pci_fixup_final_inited'
PCI: build resource code for M68K architecture
PCI: pciehp: remove unused pciehp_get_max_lnk_width(), pciehp_get_cur_lnk_width()
PCI: reorder __pci_assign_resource() (no change)
PCI: fix truncation of resource size to 32 bits
PCI: acpiphp: merge acpiphp_debug and debug
PCI: acpiphp: remove unused res_lock
sparc/PCI: replace pci_cfg_fake_ranges() with pci_read_bridge_bases()
PCI: call final fixups hot-added devices
PCI: move final fixups from __init to __devinit
x86/PCI: move final fixups from __init to __devinit
MIPS/PCI: move final fixups from __init to __devinit
PCI: support sizing P2P bridge I/O windows with 1K granularity
PCI: reimplement P2P bridge 1K I/O windows (Intel P64H2)
PCI: disable MEM decoding while updating 64-bit MEM BARs
PCI: leave MEM and IO decoding disabled during 64-bit BAR sizing, too
PCI: never discard enable/suspend/resume_early/resume fixups
PCI: release temporary reference in __nv_msi_ht_cap_quirk()
PCI: restructure 'pci_do_fixups()'
...
A small set of changes for devicetree:
- Couple of Documentation fixes
- Addition of new helper function of_node_full_name
- Improve of_parse_phandle_with_args return values
- Some NULL related sparse fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQEcBAABAgAGBQJQDwsgAAoJEMhvYp4jgsXiuwUH/Ri6ZSnqHcz4Wa/X4FxvNc3I
3Xelo/Vt3WLYue3s/+OYiM5FK9+KH8T6x+U79Q4p7vePcfUh6GJII0AUbMeRghkS
m3FjNd5syzYNJlnDnqdngQYRDpaz8U/SyftjXyMPjJ1VWiyLx/EJQUkj1EEwDLe/
ZVabppnco3Y6OJpFuETONNvXx5mE7xq86isW5+aYmviMkWSMMwJPf8qofLJ78Dh5
OAhWuCPRDooz548+Wkabt90qHjF6FU43w5fU7zZW26NT39ptppcbZ2bAXcTYqIIq
sATp5YSitvwFqO2c1mA/drZ9nrgxDPCaw3qCDyiMdcbWgXqDirz2x7q1iauVHF4=
=5TZ/
-----END PGP SIGNATURE-----
Merge tag 'dt-for-3.6' of git://sources.calxeda.com/kernel/linux
Pull devicetree updates from Rob Herring:
"A small set of changes for devicetree:
- Couple of Documentation fixes
- Addition of new helper function of_node_full_name
- Improve of_parse_phandle_with_args return values
- Some NULL related sparse fixes"
Grant's busy packing.
* tag 'dt-for-3.6' of git://sources.calxeda.com/kernel/linux:
of: mtd: nuke useless const qualifier
devicetree: add helper inline for retrieving a node's full name
of: return -ENOENT when no property
usage-model.txt: fix typo machine_init->init_machine
of: Fix null pointer related warnings in base.c file
LED: Fix missing semicolon in OF documentation
of: fix a few typos in the binding documentation
Function eeh_event_handler() dereferences the pointer returned by
handle_eeh_events() without checking, causing a crash if NULL was
returned, which is expected in some situations.
This patch fixes this bug by checking for the value returned by
handle_eeh_events() before dereferencing it.
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: stable@vger.kernel.org [v3.4+]
prom_update_property() currently fails if the property doesn't
actually exist yet which isn't what we want. Change to add-or-update
instead of update-only, then we can remove a lot duplicated lines.
Suggested-by: Grant Likely <grant.likely@secretlab.ca>
Signed-off-by: Dong Aisheng <dong.aisheng@linaro.org>
Acked-by: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently the call to pseries_notify_cpuidle_add_cpu(), that takes
action on the cpuidle front when a cpu is added/removed
is being made from smp_xics_setup_cpu().
This caused lockdep issues as
reported https://lkml.org/lkml/2012/5/17/2
On addition of each cpu,
resources were cleared and re-allocated each time, all in critical
section as part of start_secondary() call were interrupts are disabled.
To resolve this issue, the pseries_notify_cpuidle_add_cpu() call is
is being replaced by a hotplug notifier which
would prevent cpuidle resources from being
released and allocated each time cpu is onlined in the critical code path.
It was fixed in https://lkml.org/lkml/2012/5/18/174.
Also it is essential to call cpuidle_enable/disable_device
between cpuidle_pause_and_lock() and
cpuidle_resume_and_unlock() when used externally
to avoid race conditions. Add support for CPU_ONLINE_FROZEN
and CPU_DEAD_FROZEN as part of hotplug notify event for
pseries_idle and unregister hotplug notifier
while exiting out. The above mentioned issues
are fixed as part of this patch.
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Merge the defines of STACKFRAMESIZE, STK_REG, STK_PARAM from different
places.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anything that uses a constructed instruction (ie. from ppc-opcode.h),
need to use the new R0 macro, as %r0 is not going to work.
Also convert usages of macros where we are just determining an offset
(usually for a load/store), like:
std r14,STK_REG(r14)(r1)
Can't use STK_REG(r14) as %r14 doesn't work in the STK_REG macro since
it's just calculating an offset.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Looks like we still have issues with pSeries and Cell idle code
vs. the lazy irq state. In fact, the reset fixes that went upstream
are exposing the problem more by causing BUG_ON() to trigger (which
this patch turns into a WARN_ON instead).
We need to be careful when using a variant of low power state that
has the side effect of turning interrupts back on, to properly set
all the SW & lazy state to look as if everything is enabled before
we enter the low power state with MSR:EE off as we will return with
MSR:EE on. If not, we have a discrepancy of state which can cause
things to go very wrong later on.
This patch moves the logic into a helper and uses it from the
pseries and cell idle code. The power4/970 idle code already got
things right (in assembly even !) so I'm not touching it. The power7
"bare metal" idle code is subtly different and correct. Remains PA6T
and some hypervisor based Cell platforms which have questionable
code in there, but they are mostly dead platforms so I'll fix them
when I manage to get final answers from the respective maintainers
about how the low power state actually works on them.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: stable@vger.kernel.org [v3.4]
The pattern (np ? np->full_name : "<none>") is rather common in the
kernel, but can also make for quite long lines. This patch adds a new
inline function, of_node_full_name() so that the test for a valid node
pointer doesn't need to be open coded at all call sites.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
tce_buildmulti_pSeriesLP uses a per cpu page to communicate with the
hypervisor. We currently rely on the IOMMU table spinlock but
subsequent patches will be removing that so disable interrupts
around all accesses of tce_page.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The following patch is to remove the pseries_notify_add_cpu() call
and replace it by a hot plug notifier.
This would prevent cpuidle resources being released and allocated each
time cpu comes online on pseries.
The earlier design was causing a lockdep problem
in start_secondary as reported on this thread
-https://lkml.org/lkml/2012/5/17/2
This applies on 3.4-rc7
Signed-off-by: Deepthi Dharwar <deepthi@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
An upcoming release of firmware will add DDW extensions, in particular
an API to "reset" the DMA window to the original configuration (32-bit,
2GB in size). With that API available, we can safely remove the default
window, increasing the resources available to firmware for creation of
larger windows for the slot in question -- if we encounter an error, we
can use the new API to reset the state of the slot.
Further, this same release of firmware will make it a hard requirement
for OSes to release the existing window before any other windows will be
shown as available, to avoid conflicts in addressing between the two
windows.
In anticipation of these changes, always remove the default window
before we do any DDW manipulations.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The following added support for powernv but broke pseries/BML:
1f1616e powerpc/powernv: Add TCE SW invalidation support
TCE_PCI_SW_INVAL was split into FREE and CREATE flags but the tests in
the pseries code were not updated to reflect this.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@kernel.org [v3.3+]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit f948501b36 ("Make hard_irq_disable() actually hard-disable
interrupts") caused check_and_cede_processor to stop working.
->irq_happened will never be zero right after a hard_irq_disable
so the compiler removes the call to cede_processor completely.
The bug was introduced back in the lazy interrupt handling rework
of 3.4 but was hidden until recently because hard_irq_disable did
nothing.
This issue will eventually appear in 3.4 stable since the
hard_irq_disable fix is marked stable, so mark this one for stable
too.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Provide an iterator to receive the log buffer content, and convert all
kmsg_dump() users to it.
The structured data in the kmsg buffer now contains binary data, which
should no longer be copied verbatim to the kmsg_dump() users.
The iterator should provide reliable access to the buffer data, and also
supports proper log line-aware chunking of data while iterating.
Signed-off-by: Kay Sievers <kay@vrfy.org>
Tested-by: Tony Luck <tony.luck@intel.com>
Reported-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Tested-by: Anton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Replace the struct pci_bus secondary/subordinate members with the
struct resource busn_res. Later we'll build a resource tree of these
bus numbers.
[bhelgaas: changelog]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Pull powerpc updates from Benjamin Herrenschmidt:
"Here are the powerpc goodies for 3.5. Main highlights are:
- Support for the NX crypto engine in Power7+
- A bunch of Anton goodness, including some micro optimization of our
syscall entry on Power7
- I converted a pile of our thermal control drivers to the new i2c
APIs (essentially turning the old therm_pm72 into a proper set of
windfarm drivers). That's one more step toward removing the
deprecated i2c APIs, there's still a few drivers to fix, but we are
getting close
- kexec/kdump support for 47x embedded cores
The big missing thing here is no updates from Freescale. Not sure
what's up here, but with Kumar not working for them anymore things are
a bit in a state of flux in that area."
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (71 commits)
powerpc: Fix irq distribution
Revert "powerpc/hw-breakpoint: Use generic hw-breakpoint interfaces for new PPC ptrace flags"
powerpc: Fixing a cputhread code documentation
powerpc/crypto: Enable the PFO-based encryption device
powerpc/crypto: Build files for the nx device driver
powerpc/crypto: debugfs routines and docs for the nx device driver
powerpc/crypto: SHA512 hash routines for nx encryption
powerpc/crypto: SHA256 hash routines for nx encryption
powerpc/crypto: AES-XCBC mode routines for nx encryption
powerpc/crypto: AES-GCM mode routines for nx encryption
powerpc/crypto: AES-ECB mode routines for nx encryption
powerpc/crypto: AES-CTR mode routines for nx encryption
powerpc/crypto: AES-CCM mode routines for nx encryption
powerpc/crypto: AES-CBC mode routines for nx encryption
powerpc/crypto: nx driver code supporting nx encryption
powerpc/pseries: Enable the PFO-based RNG accelerator
powerpc/pseries/hwrng: PFO-based hwrng driver
powerpc/pseries: Add PFO support to the VIO bus
powerpc/pseries: Add pseries update notifier for OFDT prop changes
powerpc/pseries: Add new hvcall constants to support PFO
...
Pull trivial updates from Jiri Kosina:
"As usual, it's mostly typo fixes, redundant code elimination and some
documentation updates."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (57 commits)
edac, mips: don't change code that has been removed in edac/mips tree
xtensa: Change mail addresses of Hannes Weiner and Oskar Schirmer
lib: Change mail address of Oskar Schirmer
net: Change mail address of Oskar Schirmer
arm/m68k: Change mail address of Sebastian Hess
i2c: Change mail address of Oskar Schirmer
net: Fix tcp_build_and_update_options comment in struct tcp_sock
atomic64_32.h: fix parameter naming mismatch
Kconfig: replace "--- help ---" with "---help---"
c2port: fix bogus Kconfig "default no"
edac: Fix spelling errors.
qla1280: Remove redundant NULL check before release_firmware() call
remoteproc: remove redundant NULL check before release_firmware()
qla2xxx: Remove redundant NULL check before release_firmware() call.
aic94xx: Get rid of redundant NULL check before release_firmware() call
tehuti: delete redundant NULL check before release_firmware()
qlogic: get rid of a redundant test for NULL before call to release_firmware()
bna: remove redundant NULL test before release_firmware()
tg3: remove redundant NULL test before release_firmware() call
typhoon: get rid of redundant conditional before all to release_firmware()
...
This adds an update notifier mechanism for changes to properties in the
device tree. One use of this would be a device driver that needs to act
on changes to it's properties in the device tree after a live migration
or a dynamic activation that is triggered by updates to ofdt properties.
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Kent Yoder <key@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When we get an EEH error we just print a backtrace with dump_stack
which is rather cryptic. We really should print something before
spewing out the backtrace.
Also switch from dump_stack to WARN so we get more information about
the fail - what modules were loaded, what process was running etc.
This was useful information when debugging a recent EEH subsystem bug.
The standard WARN output should also get picked up by monitoring
tools like kerneloops.
The register dump is of questionable value here but I figured it was
better to use something standard and not roll my own.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We have a union containing fields from the old iseries hypervisor
that has been reused for the cede latency hint. Since we no
longer support iseries, remove the union completely.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Recently, Ryan Wang tried to compile PPC pSeries platform without
CONFIG_EEH and eventually run into errors. Nishanth Aravamudan
helped to narrow down the root cause. Actually, the pSeries platform
depends on CONFIG_EEH heavily and that won't work properly without
EEH support.
According to Ben's suggestion, the patch make CONFIG_EEH invisible
and keep it as always selected on pSeries platform.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The problem was reported by Anton Blanchard. While EEH error
happened to the PCI device without the corresponding device
driver, kernel crash was seen. Eventually, I successfully
reproduced the problem on Firebird-L machine with utility
"errinjct". Initially, the device driver for Emulex ethernet
MAC has been disabled from .config and force data parity on
the Emulex ethernet MAC with help of "errinjct". Eventually,
I saw the kernel crash after issueing couple of "lspci -v"
command.
The root cause behind is that the PCI device, including the
reference to the corresponding eeh device, will be removed
from the system while EEH does recovery. Afterwards, the
PCI device will be probed again and added into the system
accordingly. So it's not safe to retrieve the eeh device from
the corresponding PCI device after the PCI device has been removed
and not added again.
The patch fixes the issue and retrieve the eeh device from OF node
instead of PCI device after the PCI device has been removed.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Tested-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
That set_current_state() won't work very well: the subsequent mutex_lock()
might flip the task back into TASK_RUNNING.
Attempt to put it somewhere where it might have been meant to be, and
attempt to describe why it might have been added.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
daemonize() is only needed when a user-space task does kernel_thread().
eeh_event_handler() thread is created by the worker kthread, and thus it
doesn't need the soon-to-be-deprecated daemonize().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Linas Vepstas <linasvepstas@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>