Pull ARM fixes from Russell King:
- Łukasz Stelmach spotted a couple of issues with the decompressor.
- a couple of kdump fixes found while testing kdump
- replace some perl with shell code
- resolve SIGFPE breakage
- kprobes fixes
* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: fix kill( ,SIGFPE) breakage
ARM: 8772/1: kprobes: Prohibit kprobes on get_user functions
ARM: 8771/1: kprobes: Prohibit kprobes on do_undefinstr
ARM: 8770/1: kprobes: Prohibit probing on optimized_callback
ARM: 8769/1: kprobes: Fix to use get_kprobe_ctlblk after irq-disabed
ARM: replace unnecessary perl with sed and the shell $(( )) operator
ARM: kexec: record parent context registers for non-crash CPUs
ARM: kexec: fix kdump register saving on panic()
ARM: 8758/1: decompressor: restore r1 and r2 just before jumping to the kernel
ARM: 8753/1: decompressor: add a missing parameter to the addruart macro
Pull x86 fixes from Thomas Gleixner:
"An unfortunately larger set of fixes, but a large portion is
selftests:
- Fix the missing clusterid initializaiton for x2apic cluster
management which caused boot failures due to IPIs being sent to the
wrong cluster
- Drop TX_COMPAT when a 64bit executable is exec()'ed from a compat
task
- Wrap access to __supported_pte_mask in __startup_64() where clang
compile fails due to a non PC relative access being generated.
- Two fixes for 5 level paging fallout in the decompressor:
- Handle GOT correctly for paging_prepare() and
cleanup_trampoline()
- Fix the page table handling in cleanup_trampoline() to avoid
page table corruption.
- Stop special casing protection key 0 as this is inconsistent with
the manpage and also inconsistent with the allocation map handling.
- Override the protection key wen moving away from PROT_EXEC to
prevent inaccessible memory.
- Fix and update the protection key selftests to address breakage and
to cover the above issue
- Add a MOV SS self test"
[ Part of the x86 fixes were in the earlier core pull due to dependencies ]
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/mm: Drop TS_COMPAT on 64-bit exec() syscall
x86/apic/x2apic: Initialize cluster ID properly
x86/boot/compressed/64: Fix moving page table out of trampoline memory
x86/boot/compressed/64: Set up GOT for paging_prepare() and cleanup_trampoline()
x86/pkeys: Do not special case protection key 0
x86/pkeys/selftests: Add a test for pkey 0
x86/pkeys/selftests: Save off 'prot' for allocations
x86/pkeys/selftests: Fix pointer math
x86/pkeys: Override pkey when moving away from PROT_EXEC
x86/pkeys/selftests: Fix pkey exhaustion test off-by-one
x86/pkeys/selftests: Add PROT_EXEC test
x86/pkeys/selftests: Factor out "instruction page"
x86/pkeys/selftests: Allow faults on unknown keys
x86/pkeys/selftests: Avoid printf-in-signal deadlocks
x86/pkeys/selftests: Remove dead debugging code, fix dprint_in_signal
x86/pkeys/selftests: Stop using assert()
x86/pkeys/selftests: Give better unexpected fault error messages
x86/selftests: Add mov_to_ss test
x86/mpx/selftests: Adjust the self-test to fresh distros that export the MPX ABI
x86/pkeys/selftests: Adjust the self-test to fresh distros that export the pkeys ABI
...
Pull UP timer fix from Thomas Gleixner:
"Work around the for_each_cpu() oddity on UP kernels in the tick
broadcast code which causes boot failures because the CPU0 bit is
always reported as set independent of the cpumask content"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/broadcast: Use for_each_cpu() specially on UP kernels
Pull scheduler fixlets from Thomas Gleixner:
"Three trivial fixlets for the scheduler:
- move print_rt_rq() and print_dl_rq() declarations to the right
place
- make grub_reclaim() static
- fix the bogus documentation reference in Kconfig"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix documentation file path
sched/deadline: Make the grub_reclaim() function static
sched/debug: Move the print_rt_rq() and print_dl_rq() declarations to kernel/sched/sched.h
Pull RAS fix from Thomas Gleixner:
"Fix a regression in the new AMD SMCA code which issues an SMP function
call from the early interrupt disabled region of CPU hotplug. To avoid
that, use cached block addresses which can be used directly"
* 'ras-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/MCE/AMD: Cache SMCA MISC block addresses
Pull perf tooling fixes from Thomas Gleixner:
- fix segfault when processing unknown threads in cs-etm
- fix "perf test inet_pton" on s390 failing due to missing inline
- display all available events on 'perf annotate --stdio'
- add missing newline when parsing an empty BPF program
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf tools: Add missing newline when parsing empty BPF proggie
perf cs-etm: Remove redundant space
perf cs-etm: Support unknown_thread in cs_etm_auxtrace
perf annotate: Display all available events on --stdio
perf test: "probe libc's inet_pton" fails on s390 due to missing inline
Pull locking fixes from Thomas Gleixner:
"Two fixes to address shortcomings of the rwsem/percpu-rwsem lock
debugging code which emits false positive warnings when the rwsem is
anonymously locked and unlocked"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/percpu-rwsem: Annotate rwsem ownership transfer by setting RWSEM_OWNER_UNKNOWN
locking/rwsem: Add a new RWSEM_ANONYMOUSLY_OWNED flag
Pull EFI fixes from Thomas Gleixner:
- Use explicitely sized type for the romimage pointer in the 32bit EFI
protocol struct so a 64bit kernel does not expand it to 64bit. Ditto
for the 64bit struct to avoid the reverse issue on 32bit kernels.
- Handle randomized tex offset correctly in the ARM64 EFI stub to avoid
unaligned data resulting in stack corruption and other hard to
diagnose wreckage.
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/libstub/arm64: Handle randomized TEXT_OFFSET
efi: Avoid potential crashes, fix the 'struct efi_pci_io_protocol_32' definition for mixed mode
Pull core fixes from Thomas Gleixner:
- Unbreak the BPF compilation which got broken by the unconditional
requirement of asm-goto, which is not supported by clang.
- Prevent probing on exception masking instructions in uprobes and
kprobes to avoid the issues of the delayed exceptions instead of
having an ugly workaround.
- Prevent a double free_page() in the error path of do_kexec_load()
- A set of objtool updates addressing various issues mostly related to
switch tables and the noreturn detection for recursive sibling calls
- Header sync for tools.
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Detect RIP-relative switch table references, part 2
objtool: Detect RIP-relative switch table references
objtool: Support GCC 8 switch tables
objtool: Support GCC 8's cold subfunctions
objtool: Fix "noreturn" detection for recursive sibling calls
objtool, kprobes/x86: Sync the latest <asm/insn.h> header with tools/objtool/arch/x86/include/asm/insn.h
x86/cpufeature: Guard asm_volatile_goto usage for BPF compilation
uprobes/x86: Prohibit probing on MOV SS instruction
kprobes/x86: Prohibit probing on exception masking instructions
x86/kexec: Avoid double free_page() upon do_kexec_load() failure
Currently ip6gre and ip6erspan share single metadata mode device,
using 'collect_md_tun'. Thus, when doing:
ip link add dev ip6gre11 type ip6gretap external
ip link add dev ip6erspan12 type ip6erspan external
RTNETLINK answers: File exists
simply fails due to the 2nd tries to create the same collect_md_tun.
The patch fixes it by adding a separate collect md tunnel device
for the ip6erspan, 'collect_md_tun_erspan'. As a result, a couple
of places need to refactor/split up in order to distinguish ip6gre
and ip6erspan.
First, move the collect_md check at ip6gre_tunnel_{unlink,link} and
create separate function {ip6gre,ip6ersapn}_tunnel_{link_md,unlink_md}.
Then before link/unlink, make sure the link_md/unlink_md is called.
Finally, a separate ndo_uninit is created for ip6erspan. Tested it
using the samples/bpf/test_tunnel_bpf.sh.
Fixes: ef7baf5e08 ("ip6_gre: add ip6 erspan collect_md mode")
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov says:
====================
Add Renesas R8A77980 GEther support
Here's a set of 3 patches against DaveM's 'net-next.git' repo. They (gradually)
add R8A77980 GEther support to the 'sh_eth' driver, starting with couple new
register bits/values introduced with this chip, and ending with adding a new
'struct sh_eth_cpu_data' instance connected to the new DT "compatible" prop
value...
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Finally, add support for the DT probing of the R-Car V3H (AKA R8A77980) --
it's the only R-Car gen3 SoC having the GEther controller -- others have
only EtherAVB...
Based on the original (and large) patch by Vladimir Barinov.
Signed-off-by: Vladimir Barinov <vladimir.barinov@cogentembedded.com>
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The R-Car V3H (AKA R8A77980) GEther controller adds the DMA burst mode bit
(NBST) in EDMR and the manual tells to always set it before doing any DMA.
Based on the original (and large) patch by Vladimir Barinov.
Signed-off-by: Vladimir Barinov <vladimir.barinov@cogentembedded.com>
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Reviewed-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
The R-Car V3H (AKA R8A77980) GEther controller adds support for the RGMII
PHY interface mode as a new value for the RMII_MII register.
Based on the original (and large) patch by Vladimir Barinov.
Signed-off-by: Vladimir Barinov <vladimir.barinov@cogentembedded.com>
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
A handful of fixes. I've been queuing them up a bit too long so the list
is longer than it otherwise would have been spread out across a few -rcs.
In general, it's a scattering of fixes across several platforms, nothing
truly serious enough to point out.
There's a slightly larger batch of them for the Davinci platforms due
to work to bring them back to life after some time, so there's a handful
of regressions, some of them going back very far, others more recent.
There's also a few patches fixing DT on Renesas platforms since they
changed some bindings without remaining backwards compatible, splitting
up describing LVDS as a proper bridge instead of having it as part of the
display unit. We could push for them to be backwards compatible with old
device trees, but it's likely to regress eventually if nobody's actually
using said compatibility.
-----BEGIN PGP SIGNATURE-----
iQJDBAABCAAtFiEElf+HevZ4QCAJmMQ+jBrnPN6EHHcFAlsAzEkPHG9sb2ZAbGl4
b20ubmV0AAoJEIwa5zzehBx34V0P/1EEUPSF6o3lplpAFHAnXGaaRyHDF07TXkEj
zjceoOPLljrynQJ23HdjRlfh2f51rWD2XjGzlScsTJ8HXYe+auMSCRBRYjwl1RVt
zAQs2+png4pPbrxw6AUJ9CTSmCUPna0dGdySEl3FfxSt7+UdonldEJr+ZvNESiW7
+jSF3twZ/hb6iOxq7xFSnh8GU0ckTm11/HUCxQ/8z4xRfGvENs66Z7cyaStkzLop
cD7wUmwe1I0HsRWkDsGUUQwu6i445edVoELWmooZByXuGWjb3Vu9xmc+yrgQTLkW
4Y3R4kx5VfDfvdN3i2z+W7ZpN47dSkAOMIbjQYl0wELdk0UPaMFTse6mDfIBmC02
dSK2FLpZYsBQX95KxQijh4jBPs+lJsekJd1qxL3ZGpSK0VF1etGhSWrkRQ0pXNmT
4VahLoEY8KBvGKZo1QJ4U2pmAIZS3oMrK9AdJANdpyN0cEiYFl1JTM9PkZfytnLU
haagJL3BJESD36vuAhhvXVWy7vuI5jXnATn9V2WH8yZVMCPh3vsPA+d9Knh3ZqXk
Vv1yZriJyX3zV6kbFoXJsOqg0TgGsyICBSpnjfuQPTtWSdSvlrUuIINFPOqE5Z3E
uFywFEkw1L8ZXxbQn8m92+VqiqeFjyhqWmK2OolQfWlDJlJrmF8ltmkeMv9EQaig
+wh8OuSw
=qPpk
-----END PGP SIGNATURE-----
Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"A handful of fixes. I've been queuing them up a bit too long so the
list is longer than it otherwise would have been spread out across a
few -rcs.
In general, it's a scattering of fixes across several platforms,
nothing truly serious enough to point out.
There's a slightly larger batch of them for the Davinci platforms due
to work to bring them back to life after some time, so there's a
handful of regressions, some of them going back very far, others more
recent.
There's also a few patches fixing DT on Renesas platforms since they
changed some bindings without remaining backwards compatible,
splitting up describing LVDS as a proper bridge instead of having it
as part of the display unit.
We could push for them to be backwards compatible with old device
trees, but it's likely to regress eventually if nobody's actually
using said compatibility"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (36 commits)
ARM: davinci: board-dm646x-evm: set VPIF capture card name
ARM: davinci: board-dm646x-evm: pass correct I2C adapter id for VPIF
ARM: davinci: dm646x: fix timer interrupt generation
ARM: keystone: fix platform_domain_notifier array overrun
arm64: dts: exynos: Fix interrupt type for I2S1 device on Exynos5433
ARM: dts: imx51-zii-rdu1: fix touchscreen bindings
firmware: arm_scmi: Use after free in scmi_create_protocol_device()
ARM: dts: cygnus: fix irq type for arm global timer
Revert "ARM: dts: logicpd-som-lv: Fix pinmux controller references"
tee: check shm references are consistent in offset/size
tee: shm: fix use-after-free via temporarily dropped reference
ARM: dts: imx7s: Pass the 'fsl,sec-era' property
ARM: dts: tegra20: Revert "Fix ULPI regression on Tegra20"
ARM: dts: correct missing "compatible" entry for ti81xx SoCs
ARM: OMAP1: ams-delta: fix deferred_fiq handler
arm64: tegra: Make BCM89610 PHY interrupt as active low
ARM: davinci: fix GPIO lookup for I2C
ARM: dts: logicpd-som-lv: Fix pinmux controller references
ARM: dts: logicpd-som-lv: Fix Audio Mute
ARM: dts: logicpd-som-lv: Fix WL127x Startup Issues
...
Marvell PPv2 Header Parser sets some bits in the 'result_info' field in
each lookup iteration, to identify different packet attributes such as
DSA / VLAN tag, protocol infos, etc. This is used in further
classification stages in the controller.
It's the DSA tag detection entry that is in charge of detecting when there
is a single VLAN tag.
This commits adds the missing update of the result_info in this case.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This contains a one-line update to the device tree of the Tegra186 P3310
processor module, fixing the polarity of the PHY interrupt. Originally,
this was queued to go into v4.18, but the PHY ID matching patch has now
found its way into v4.17-rc5, which means that the PHY driver will know
how to identify the PHY on this board and try to use the interrupt. This
will unfortunately cause networking to break on P3310, hence why I think
this should go into v4.17.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEEiOrDCAFJzPfAjcif3SOs138+s6EFAlr9pZETHHRyZWRpbmdA
bnZpZGlhLmNvbQAKCRDdI6zXfz6zoYqKD/40gsUGU9dVl0i3TAK006iF+sF05IyE
S6XT/laNM9GMletWfMVuiBwr/ftXV0+iokFh1e2EsgagMkmPsD/qdkOymyQTuddu
rFD8Z12EIVNtbXoiwKNZOmAEz1ZVt4nGPDHcsxtZrg78VpeipVsVKZpF2204x72H
vzRLCJpltTfaD2F+vyIUeyhRD/OUAIjILnDjz5LWbDPDCTU/5/YrqT3/JfoSmu7l
c6qBUMMcVIzBr1HOn12OZAyIYadv8HaPzZQaPhl7jGsCCbDX6GyKmp4fsf4UHBRz
tdzhrdjkEp4EUDOCU7lgtVTAmDYFjTmm/h9/a6V3v7DpfZv/MIs9MhplhQDYITFo
vibkrpM1srEKMhF0TmApXptS7zVKnl02+Uxw+R7hxDNw+PGrtOqiLMq+QFa0CvLm
lPeLV9pomCfT+lhgdg+rtRtc6+G1jOemfpgdpc9Ezf83Jcgct1nuUz+JZ3sftyoF
wXv1WCrg2t7tRPfCkPivgz7i49zvW7DtiuWY6xvkciXF4yw801wxNlocJpDWsF8S
Bm1vlYyosJXBd6B7PZfEFQbKbXjmGi/v8ELFjilqnFt3n5izsQWQ/FKsYMa1+6Sl
A8868aAxrLr9mnd05c+7zh9r8QQhMowZI3LLRd9GIndQPbYa4ly3kq7iG7wDaM6Q
wD/jqJyfz9YlWg==
=+6cH
-----END PGP SIGNATURE-----
Merge tag 'tegra-for-4.17-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux into fixes
arm64: tegra: Device tree fixes for v4.17
This contains a one-line update to the device tree of the Tegra186 P3310
processor module, fixing the polarity of the PHY interrupt. Originally,
this was queued to go into v4.18, but the PHY ID matching patch has now
found its way into v4.17-rc5, which means that the PHY driver will know
how to identify the PHY on this board and try to use the interrupt. This
will unfortunately cause networking to break on P3310, hence why I think
this should go into v4.17.
* tag 'tegra-for-4.17-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
arm64: tegra: Make BCM89610 PHY interrupt as active low
Signed-off-by: Olof Johansson <olof@lixom.net>
Jiri Pirko says:
====================
devlink: introduce port flavours and common phys_port_name generation
This patchset resolves 2 issues we have right now:
1) There are many netdevices / ports in the system, for port, pf, vf
represenatation but the user has no way to see which is which
2) The ndo_get_phys_port_name is implemented in each driver separatelly,
which may lead to inconsistent names between drivers.
This patchset introduces port flavours which should address the first
problem. In this initial patchset, I focus on DSA and their port
flavours. As a follow-up, I plan to add PF and VF representor flavours.
However, that needs additional dependencies in drivers (nfp, mlx5).
The common phys_port_name generation is used by mlxsw. An example output
for mlxsw looks like this:
...
pci/0000:03:00.0/59: type eth netdev enp3s0np4 flavour physical number 4
pci/0000:03:00.0/61: type eth netdev enp3s0np1 flavour physical number 1
pci/0000:03:00.0/63: type eth netdev enp3s0np2 flavour physical number 2
pci/0000:03:00.0/49: type eth netdev enp3s0np8s0 flavour physical number 8 split_group 8 subport 0
pci/0000:03:00.0/50: type eth netdev enp3s0np8s1 flavour physical number 8 split_group 8 subport 1
pci/0000:03:00.0/51: type eth netdev enp3s0np8s2 flavour physical number 8 split_group 8 subport 2
pci/0000:03:00.0/52: type eth netdev enp3s0np8s3 flavour physical number 8 split_group 8 subport 3
As you can see, the netdev names are generated according to the flavour
and port number. In case the port is split, the split subnumber is also
included.
An example output for dsa_loop testing module looks like this:
mdio_bus/fixed-0:1f/0: type eth netdev lan1 flavour physical number 0
mdio_bus/fixed-0:1f/1: type eth netdev lan2 flavour physical number 1
mdio_bus/fixed-0:1f/2: type eth netdev lan3 flavour physical number 2
mdio_bus/fixed-0:1f/3: type eth netdev lan4 flavour physical number 3
mdio_bus/fixed-0:1f/4: type notset
mdio_bus/fixed-0:1f/5: type notset flavour cpu number 5
mdio_bus/fixed-0:1f/6: type notset
mdio_bus/fixed-0:1f/7: type notset
mdio_bus/fixed-0:1f/8: type notset
mdio_bus/fixed-0:1f/9: type notset
mdio_bus/fixed-0:1f/10: type notset
mdio_bus/fixed-0:1f/11: type notset
---
RFC->v1:
-removed nfp patches, removed DSA patch that used name generation helper
-patch 1:
- Reduced the nfp change just to simply use newly created attr_set func
-patch 2:
- rebased
- removed pf/vf reps flavours
-patch 3:
- rebased
-patch 4:
- added missing break pointed out by Andrew
====================
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since devlink knows the info needed to generate the physical port name
in a generic way for all devlink users, use the helper to do the job.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Set the attrs and allow to expose port flavour to user via devlink.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Each driver implements physical port name generation by itself. However
as devlink has all needed info, it can easily do the job for all its
users. So implement this helper in devlink.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Devlink ports can have specific flavour according to the purpose of use.
This patch extend attrs_set so the driver can say which flavour port
has. Initial flavours are:
physical, cpu, dsa
User can query this to see right away what is the purpose of each port.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change existing setter for split port information into more generic
attrs setter. Alongside with that, allow to set port number and subport
number for split ports.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7771c66457 ("signal/arm: Document conflicts with SI_USER and
SIGFPE") broke the siginfo structure for userspace triggered signals,
causing the strace testsuite to regress. Fix this by eliminating
the FPE_FIXME definition (which is at the root of the breakage) and
use FPE_FLTINV instead for the case where the hardware appears to be
reporting nonsense.
Fixes: 7771c66457 ("signal/arm: Document conflicts with SI_USER and SIGFPE")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Commit be83bbf806 ("mmap: introduce sane default mmap limits") was
introduced to catch problems in various ad-hoc character device drivers
doing mmap and getting the size limits wrong. In the process, it used
"known good" limits for the normal cases of mapping regular files and
block device drivers.
It turns out that the "s_maxbytes" limit was less "known good" than I
thought. In particular, /proc doesn't set it, but exposes one regular
file to mmap: /proc/vmcore. As a result, that file got limited to the
default MAX_INT s_maxbytes value.
This went unnoticed for a while, because apparently the only thing that
needs it is the s390 kernel zfcpdump, but there might be other tools
that use this too.
Vasily suggested just changing s_maxbytes for all of /proc, which isn't
wrong, but makes me nervous at this stage. So instead, just make the
new mmap limit always be MAX_LFS_FILESIZE for regular files, which won't
affect anything else. It wasn't the regular file case I was worried
about.
I'd really prefer for maxsize to have been per-inode, but that is not
how things are today.
Fixes: be83bbf806 ("mmap: introduce sane default mmap limits")
Reported-by: Vasily Gorbik <gor@linux.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... into a global, two-dimensional array and service subsequent reads from
that cache to avoid rdmsr_on_cpu() calls during CPU hotplug (IPIs with IRQs
disabled).
In addition, this fixes a KASAN slab-out-of-bounds read due to wrong usage
of the bank->blocks pointer.
Fixes: 27bd595027 ("x86/mce/AMD: Get address from already initialized block")
Reported-by: Johannes Hirte <johannes.hirte@datenkhaos.de>
Tested-by: Johannes Hirte <johannes.hirte@datenkhaos.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>
Link: http://lkml.kernel.org/r/20180414004230.GA2033@probook
Since do_undefinstr() uses get_user to get the undefined
instruction, it can be called before kprobes processes
recursive check. This can cause an infinit recursive
exception.
Prohibit probing on get_user functions.
Fixes: 24ba613c9d ("ARM kprobes: core code")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Prohibit kprobes on do_undefinstr because kprobes on
arm is implemented by undefined instruction. This means
if we probe do_undefinstr(), it can cause infinit
recursive exception.
Fixes: 24ba613c9d ("ARM kprobes: core code")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Prohibit probing on optimized_callback() because
it is called from kprobes itself. If we put a kprobes
on it, that will cause a recursive call loop.
Mark it NOKPROBE_SYMBOL.
Fixes: 0dc016dbd8 ("ARM: kprobes: enable OPTPROBES for ARM 32")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Since get_kprobe_ctlblk() uses smp_processor_id() to access
per-cpu variable, it hits smp_processor_id sanity check as below.
[ 7.006928] BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1
[ 7.007859] caller is debug_smp_processor_id+0x20/0x24
[ 7.008438] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.16.0-rc1-00192-g4eb17253e4b5 #1
[ 7.008890] Hardware name: Generic DT based system
[ 7.009917] [<c0313f0c>] (unwind_backtrace) from [<c030e6d8>] (show_stack+0x20/0x24)
[ 7.010473] [<c030e6d8>] (show_stack) from [<c0c64694>] (dump_stack+0x84/0x98)
[ 7.010990] [<c0c64694>] (dump_stack) from [<c071ca5c>] (check_preemption_disabled+0x138/0x13c)
[ 7.011592] [<c071ca5c>] (check_preemption_disabled) from [<c071ca80>] (debug_smp_processor_id+0x20/0x24)
[ 7.012214] [<c071ca80>] (debug_smp_processor_id) from [<c03335e0>] (optimized_callback+0x2c/0xe4)
[ 7.013077] [<c03335e0>] (optimized_callback) from [<bf0021b0>] (0xbf0021b0)
To fix this issue, call get_kprobe_ctlblk() right after
irq-disabled since that disables preemption.
Fixes: 0dc016dbd8 ("ARM: kprobes: enable OPTPROBES for ARM 32")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
You can build a kernel in a cross compiling environment that doesn't
have perl in the $PATH. Commit 429f7a062e broke that for 32 bit
ARM. Fix it.
As reported by Stephen Rothwell, it appears that the symbols can be
either part of the BSS section or absolute symbols depending on the
binutils version. When they're an absolute symbol, the $(( ))
operator errors out and the build fails. Fix this as well.
Fixes: 429f7a062e ("ARM: decompressor: fix BSS size calculation")
Reported-by: Rob Landley <rob@landley.net>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Rob Landley <rob@landley.net>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
How we got to machine_crash_nonpanic_core() (iow, from an IPI, etc) is
not interesting for debugging a crash. The more interesting context
is the parent context prior to the IPI being received.
Record the parent context register state rather than the register state
in machine_crash_nonpanic_core(), which is more relevant to the failing
condition.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
When a panic() occurs, the kexec code uses smp_send_stop() to stop
the other CPUs, but this results in the CPU register state not being
saved, and gdb is unable to inspect the state of other CPUs.
Commit 0ee59413c9 ("x86/panic: replace smp_send_stop() with kdump
friendly version in panic path") addressed the issue on x86, but
ignored other architectures. Address the issue on ARM by splitting
out the crash stop implementation to crash_smp_send_stop() and
adding the necessary protection.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
The hypervisor setup before __enter_kernel destroys the value
sotred in r1. The value needs to be restored just before the jump.
Fixes: 6b52f7bdb8 ("ARM: hyp-stub: Use r1 for the soft-restart address")
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
In commit 639da5ee37 ("ARM: add an extra temp register to the low
level debugging addruart macro") an additional temporary register was
added to the addruart macro, but the decompressor code wasn't updated.
Fixes: 639da5ee37 ("ARM: add an extra temp register to the low level debugging addruart macro")
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
The x86 mmap() code selects the mmap base for an allocation depending on
the bitness of the syscall. For 64bit sycalls it select mm->mmap_base and
for 32bit mm->mmap_compat_base.
exec() calls mmap() which in turn uses in_compat_syscall() to check whether
the mapping is for a 32bit or a 64bit task. The decision is made on the
following criteria:
ia32 child->thread.status & TS_COMPAT
x32 child->pt_regs.orig_ax & __X32_SYSCALL_BIT
ia64 !ia32 && !x32
__set_personality_x32() was dropping TS_COMPAT flag, but
set_personality_64bit() has kept compat syscall flag making
in_compat_syscall() return true during the first exec() syscall.
Which in result has user-visible effects, mentioned by Alexey:
1) It breaks ASAN
$ gcc -fsanitize=address wrap.c -o wrap-asan
$ ./wrap32 ./wrap-asan true
==1217==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING.
==1217==ASan shadow was supposed to be located in the [0x00007fff7000-0x10007fff7fff] range.
==1217==Process memory map follows:
0x000000400000-0x000000401000 /home/izbyshev/test/gcc/asan-exec-from-32bit/wrap-asan
0x000000600000-0x000000601000 /home/izbyshev/test/gcc/asan-exec-from-32bit/wrap-asan
0x000000601000-0x000000602000 /home/izbyshev/test/gcc/asan-exec-from-32bit/wrap-asan
0x0000f7dbd000-0x0000f7de2000 /lib64/ld-2.27.so
0x0000f7fe2000-0x0000f7fe3000 /lib64/ld-2.27.so
0x0000f7fe3000-0x0000f7fe4000 /lib64/ld-2.27.so
0x0000f7fe4000-0x0000f7fe5000
0x7fed9abff000-0x7fed9af54000
0x7fed9af54000-0x7fed9af6b000 /lib64/libgcc_s.so.1
[snip]
2) It doesn't seem to be great for security if an attacker always knows
that ld.so is going to be mapped into the first 4GB in this case
(the same thing happens for PIEs as well).
The testcase:
$ cat wrap.c
int main(int argc, char *argv[]) {
execvp(argv[1], &argv[1]);
return 127;
}
$ gcc wrap.c -o wrap
$ LD_SHOW_AUXV=1 ./wrap ./wrap true |& grep AT_BASE
AT_BASE: 0x7f63b8309000
AT_BASE: 0x7faec143c000
AT_BASE: 0x7fbdb25fa000
$ gcc -m32 wrap.c -o wrap32
$ LD_SHOW_AUXV=1 ./wrap32 ./wrap true |& grep AT_BASE
AT_BASE: 0xf7eff000
AT_BASE: 0xf7cee000
AT_BASE: 0x7f8b9774e000
Fixes: 1b028f784e ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()")
Fixes: ada26481df ("x86/mm: Make in_compat_syscall() work during exec")
Reported-by: Alexey Izbyshev <izbyshev@ispras.ru>
Bisected-by: Alexander Monakov <amonakov@ispras.ru>
Investigated-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Alexander Monakov <amonakov@ispras.ru>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/20180517233510.24996-1-dima@arista.com
With the following commit:
fd35c88b74 ("objtool: Support GCC 8 switch tables")
I added a "can't find switch jump table" warning, to stop covering up
silent failures if add_switch_table() can't find anything.
That warning found yet another bug in the objtool switch table detection
logic. For cases 1 and 2 (as described in the comments of
find_switch_table()), the find_symbol_containing() check doesn't adjust
the offset for RIP-relative switch jumps.
Incidentally, this bug was already fixed for case 3 with:
6f5ec2993b ("objtool: Detect RIP-relative switch table references")
However, that commit missed the fix for cases 1 and 2.
The different cases are now starting to look more and more alike. So
fix the bug by consolidating them into a single case, by checking the
original dynamic jump instruction in the case 3 loop.
This also simplifies the code and makes it more robust against future
switch table detection issues -- of which I'm sure there will be many...
Switch table detection has been the most fragile area of objtool, by
far. I long for the day when we'll have a GCC plugin for annotating
switch tables. Linus asked me to delay such a plugin due to the
flakiness of the plugin infrastructure in older versions of GCC, so this
rickety code is what we're stuck with for now. At least the code is now
a little simpler than it was.
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f400541613d45689086329432f3095119ffbc328.1526674218.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When CONFIG_RANDOMIZE_TEXT_OFFSET=y, TEXT_OFFSET is an arbitrary
multiple of PAGE_SIZE in the interval [0, 2MB).
The EFI stub does not account for the potential misalignment of
TEXT_OFFSET relative to EFI_KIMG_ALIGN, and produces a randomized
physical offset which is always a round multiple of EFI_KIMG_ALIGN.
This may result in statically allocated objects whose alignment exceeds
PAGE_SIZE to appear misaligned in memory. This has been observed to
result in spurious stack overflow reports and failure to make use of
the IRQ stacks, and theoretically could result in a number of other
issues.
We can OR in the low bits of TEXT_OFFSET to ensure that we have the
necessary offset (and hence preserve the misalignment of TEXT_OFFSET
relative to EFI_KIMG_ALIGN), so let's do that.
Reported-by: Kim Phillips <kim.phillips@arm.com>
Tested-by: Kim Phillips <kim.phillips@arm.com>
[ardb: clarify comment and commit log, drop unneeded parens]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Fixes: 6f26b36711 ("arm64: kaslr: increase randomization granularity")
Link: http://lkml.kernel.org/r/20180518140841.9731-2-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge misc fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
hfsplus: stop workqueue when fill_super() failed
mm: don't allow deferred pages with NEED_PER_CPU_KM
MAINTAINERS: add Q: entry to kselftest for patchwork project
radix tree: fix multi-order iteration race
radix tree test suite: multi-order iteration race
radix tree test suite: add item_delete_rcu()
radix tree test suite: fix compilation issue
radix tree test suite: fix mapshift build target
include/linux/mm.h: add new inline function vmf_error()
lib/test_bitmap.c: fix bitmap optimisation tests to report errors correctly
Remove the last of the "select DELL_SMBIOS" references in the Kconfig.
The following is an automated git shortlog:
DELL_WMI use depends on instead of select for DELL_SMBIOS
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJa/2FkAAoJEKbMaAwKp364LkoH/RCYsANwPSfM/4irrJ4ECJeP
SYExOC7+E4BsPEKR62j0P3aNCojxKk/oCnDGxYt1Ah0D/z2s08HHfHF6+yBLVsRQ
85bxkz/5bPcvC4KFqEcJLR395gb4iGHxvIULNLBt0F+Lj2cVYiW7b7veVKujj4Qo
o+UlozhiXhW3WLcnaMTVVeZfxMFTsaGQjWvOaSTm4T/5/v+7EHYP62LFep2cs8P0
Ldk5noe5/i488YU4FsK7LRM0AD+c8Z9E4kf6ctSDG+1gAHTlNtol2zNiOKjZ0MNE
lWW5t/PhrznGnAiFH8OOxMBzLIoNnYJClgNPXzcocD1+EDu4wTo5bsgn+JlUaJ4=
=ehCv
-----END PGP SIGNATURE-----
Merge tag 'platform-drivers-x86-v4.17-3' of git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform driver fix from Darren Hart:
"Remove the last of the "select DELL_SMBIOS" references in the Kconfig"
* tag 'platform-drivers-x86-v4.17-3' of git://git.infradead.org/linux-platform-drivers-x86:
platform/x86: DELL_WMI use depends on instead of select for DELL_SMBIOS
stm32 clk drivers that really always need to be there when that
particular machine is compiled in and boot fix on i.MX for Stefan who
noticed odd behavior from the critical flag patch that came in during
the merge window.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCAAvFiEE9L57QeeUxqYDyoaDrQKIl8bklSUFAlr/VuERHHNib3lkQGtl
cm5lbC5vcmcACgkQrQKIl8bklSWbaw//f4DJNsqOs9t4ShG1VopHBLQ0/YKrUr1B
Qg+VPz4XPA7iGlXWEbhT39fIbLJuBNijrzWhwNpRPRawmqatG5VoqrCe4f6qzuXH
wJGI3vWL3yTlr8rsuSj9iU5z90D23sqpOvzON4CHUAHufKccyQJpsyhhxJNjiv59
bXVOYM33TUK+0DVnFr4jy1ElxgIkKj5Uak9NbuIXBjYzI3qtuk11J2zZndP8UwzM
QxD3PD28IkojekibDPVqaoVcrXr33lyWBepQYyej7Hfsj3zi7g2wbDqXltJipjJf
39wOZLM7sx+0HQK/AI0Z9yUBcLiJtYSvvPCyj38zf4HAleyCtgJLM0pRIlDhPamH
GODgSGWK0BMis+55QVUcnIHhP6u/H1ldSr/oMDFhZEp57IM2RVZar3Wypo2G96s6
RLR5y+jAMkRRswgiTMJpdpOcMnB/bPJsc4wZ5aT7pDWBm26G9rBhh3uFmTYmmGrz
JTsP5OMNyCd0rt6mt43ZUoB1t//CLU2iFBm9xd+irjjJQQGtXsPCd6yuDrWKYxel
KSVRvK/M+6VxAklrdBjMQrZ5wtvuvNqUQ3girGlhxOo9PxZvZ5h6ZO2ehbJarMRo
YhlMFvpdaGf77qPn3ot7l/+hyyL1/zquN/KaZODkEwPjD1fC4O+CfI3hk8nfbkqJ
m9i+b1j1k8o=
=rmOB
-----END PGP SIGNATURE-----
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
- a modified revert of a patch that made new choices come out for a
couple stm32 clk drivers that really always need to be there when
that particular machine is compiled in
- boot fix on i.MX for Stefan who noticed odd behavior from the
critical flag patch that came in during the merge window
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: stm32: fix: stm32 clock drivers are not compiled by default
clk: imx6ull: use OSC clock during AXI rate change
syzbot is reporting ODEBUG messages at hfsplus_fill_super() [1]. This
is because hfsplus_fill_super() forgot to call cancel_delayed_work_sync().
As far as I can see, it is hfsplus_mark_mdb_dirty() from
hfsplus_new_inode() in hfsplus_fill_super() that calls
queue_delayed_work(). Therefore, I assume that hfsplus_new_inode() does
not fail if queue_delayed_work() was called, and the out_put_hidden_dir
label is the appropriate location to call cancel_delayed_work_sync().
[1] https://syzkaller.appspot.com/bug?id=a66f45e96fdbeb76b796bf46eb25ea878c42a6c9
Link: http://lkml.kernel.org/r/964a8b27-cd69-357c-fe78-76b066056201@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+4f2e5f086147d543ab03@syzkaller.appspotmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is unsafe to do virtual to physical translations before mm_init() is
called if struct page is needed in order to determine the memory section
number (see SECTION_IN_PAGE_FLAGS). This is because only in mm_init()
we initialize struct pages for all the allocated memory when deferred
struct pages are used.
My recent fix in commit c9e97a1997 ("mm: initialize pages on demand
during boot") exposed this problem, because it greatly reduced number of
pages that are initialized before mm_init(), but the problem existed
even before my fix, as Fengguang Wu found.
Below is a more detailed explanation of the problem.
We initialize struct pages in four places:
1. Early in boot a small set of struct pages is initialized to fill the
first section, and lower zones.
2. During mm_init() we initialize "struct pages" for all the memory that
is allocated, i.e reserved in memblock.
3. Using on-demand logic when pages are allocated after mm_init call
(when memblock is finished)
4. After smp_init() when the rest free deferred pages are initialized.
The problem occurs if we try to do va to phys translation of a memory
between steps 1 and 2. Because we have not yet initialized struct pages
for all the reserved pages, it is inherently unsafe to do va to phys if
the translation itself requires access of "struct page" as in case of
this combination: CONFIG_SPARSE && !CONFIG_SPARSE_VMEMMAP
The following path exposes the problem:
start_kernel()
trap_init()
setup_cpu_entry_areas()
setup_cpu_entry_area(cpu)
get_cpu_gdt_paddr(cpu)
per_cpu_ptr_to_phys(addr)
pcpu_addr_to_page(addr)
virt_to_page(addr)
pfn_to_page(__pa(addr) >> PAGE_SHIFT)
We disable this path by not allowing NEED_PER_CPU_KM with deferred
struct pages feature.
The problems are discussed in these threads:
http://lkml.kernel.org/r/20180418135300.inazvpxjxowogyge@wfg-t540p.sh.intel.comhttp://lkml.kernel.org/r/20180419013128.iurzouiqxvcnpbvz@wfg-t540p.sh.intel.comhttp://lkml.kernel.org/r/20180426202619.2768-1-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180515175124.1770-1-pasha.tatashin@oracle.com
Fixes: 3a80a7fa79 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Dennis Zhou <dennisszhou@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A new patchwork project is created to track kselftest patches. Update
the kselftest entry in the MAINTAINERS file adding 'Q:' entry:
https://patchwork.kernel.org/project/linux-kselftest/list/
Link: http://lkml.kernel.org/r/20180515164427.12201-1-shuah@kernel.org
Signed-off-by: Shuah Khan (Samsung OSG) <shuah@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a race in the multi-order iteration code which causes the kernel to
hit a GP fault. This was first seen with a production v4.15 based
kernel (4.15.6-300.fc27.x86_64) utilizing a DAX workload which used
order 9 PMD DAX entries.
The race has to do with how we tear down multi-order sibling entries
when we are removing an item from the tree. Remember for example that
an order 2 entry looks like this:
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
where 'entry' is in some slot in the struct radix_tree_node, and the
three slots following 'entry' contain sibling pointers which point back
to 'entry.'
When we delete 'entry' from the tree, we call :
radix_tree_delete()
radix_tree_delete_item()
__radix_tree_delete()
replace_slot()
replace_slot() first removes the siblings in order from the first to the
last, then at then replaces 'entry' with NULL. This means that for a
brief period of time we end up with one or more of the siblings removed,
so:
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
This causes an issue if you have a reader iterating over the slots in
the tree via radix_tree_for_each_slot() while only under
rcu_read_lock()/rcu_read_unlock() protection. This is a common case in
mm/filemap.c.
The issue is that when __radix_tree_next_slot() => skip_siblings() tries
to skip over the sibling entries in the slots, it currently does so with
an exact match on the slot directly preceding our current slot.
Normally this works:
V preceding slot
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
^ current slot
This lets you find the first sibling, and you skip them all in order.
But in the case where one of the siblings is NULL, that slot is skipped
and then our sibling detection is interrupted:
V preceding slot
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
^ current slot
This means that the sibling pointers aren't recognized since they point
all the way back to 'entry', so we think that they are normal internal
radix tree pointers. This causes us to think we need to walk down to a
struct radix_tree_node starting at the address of 'entry'.
In a real running kernel this will crash the thread with a GP fault when
you try and dereference the slots in your broken node starting at
'entry'.
We fix this race by fixing the way that skip_siblings() detects sibling
nodes. Instead of testing against the preceding slot we instead look
for siblings via is_sibling_entry() which compares against the position
of the struct radix_tree_node.slots[] array. This ensures that sibling
entries are properly identified, even if they are no longer contiguous
with the 'entry' they point to.
Link: http://lkml.kernel.org/r/20180503192430.7582-6-ross.zwisler@linux.intel.com
Fixes: 148deab223 ("radix-tree: improve multiorder iterators")
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: CR, Sapthagirish <sapthagirish.cr@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a test which shows a race in the multi-order iteration code. This
test reliably hits the race in under a second on my machine, and is the
result of a real bug report against kernel a production v4.15 based
kernel (4.15.6-300.fc27.x86_64). With a real kernel this issue is hit
when using order 9 PMD DAX radix tree entries.
The race has to do with how we tear down multi-order sibling entries
when we are removing an item from the tree. Remember that an order 2
entry looks like this:
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
where 'entry' is in some slot in the struct radix_tree_node, and the
three slots following 'entry' contain sibling pointers which point back
to 'entry.'
When we delete 'entry' from the tree, we call :
radix_tree_delete()
radix_tree_delete_item()
__radix_tree_delete()
replace_slot()
replace_slot() first removes the siblings in order from the first to the
last, then at then replaces 'entry' with NULL. This means that for a
brief period of time we end up with one or more of the siblings removed,
so:
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
This causes an issue if you have a reader iterating over the slots in
the tree via radix_tree_for_each_slot() while only under
rcu_read_lock()/rcu_read_unlock() protection. This is a common case in
mm/filemap.c.
The issue is that when __radix_tree_next_slot() => skip_siblings() tries
to skip over the sibling entries in the slots, it currently does so with
an exact match on the slot directly preceding our current slot.
Normally this works:
V preceding slot
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
^ current slot
This lets you find the first sibling, and you skip them all in order.
But in the case where one of the siblings is NULL, that slot is skipped
and then our sibling detection is interrupted:
V preceding slot
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
^ current slot
This means that the sibling pointers aren't recognized since they point
all the way back to 'entry', so we think that they are normal internal
radix tree pointers. This causes us to think we need to walk down to a
struct radix_tree_node starting at the address of 'entry'.
In a real running kernel this will crash the thread with a GP fault when
you try and dereference the slots in your broken node starting at
'entry'.
In the radix tree test suite this will be caught by the address
sanitizer:
==27063==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x60c0008ae400 at pc 0x00000040ce4f bp 0x7fa89b8fcad0 sp 0x7fa89b8fcac0
READ of size 8 at 0x60c0008ae400 thread T3
#0 0x40ce4e in __radix_tree_next_slot /home/rzwisler/project/linux/tools/testing/radix-tree/radix-tree.c:1660
#1 0x4022cc in radix_tree_next_slot linux/../../../../include/linux/radix-tree.h:567
#2 0x4022cc in iterator_func /home/rzwisler/project/linux/tools/testing/radix-tree/multiorder.c:655
#3 0x7fa8a088d50a in start_thread (/lib64/libpthread.so.0+0x750a)
#4 0x7fa8a03bd16e in clone (/lib64/libc.so.6+0xf516e)
Link: http://lkml.kernel.org/r/20180503192430.7582-5-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the lifetime of "struct item" entries in the radix tree are
not controlled by RCU, but are instead deleted inline as they are
removed from the tree.
In the following patches we add a test which has threads iterating over
items pulled from the tree and verifying them in an
rcu_read_lock()/rcu_read_unlock() section. This means that though an
item has been removed from the tree it could still be being worked on by
other threads until the RCU grace period expires. So, we need to
actually free the "struct item" structures at the end of the grace
period, just as we do with "struct radix_tree_node" items.
Link: http://lkml.kernel.org/r/20180503192430.7582-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pulled from a patch from Matthew Wilcox entitled "xarray: Add definition
of struct xarray":
> From: Matthew Wilcox <mawilcox@microsoft.com>
> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
https://patchwork.kernel.org/patch/10341249/
These defines fix this compilation error:
In file included from ./linux/radix-tree.h:6:0,
from ./linux/../../../../include/linux/idr.h:15,
from ./linux/idr.h:1,
from idr.c:4:
./linux/../../../../include/linux/idr.h: In function `idr_init_base':
./linux/../../../../include/linux/radix-tree.h:129:2: warning: implicit declaration of function `spin_lock_init'; did you mean `spinlock_t'? [-Wimplicit-function-declaration]
spin_lock_init(&(root)->xa_lock); \
^
./linux/../../../../include/linux/idr.h:126:2: note: in expansion of macro `INIT_RADIX_TREE'
INIT_RADIX_TREE(&idr->idr_rt, IDR_RT_MARKER);
^~~~~~~~~~~~~~~
by providing a spin_lock_init() wrapper for the v4.17-rc* version of the
radix tree test suite.
Link: http://lkml.kernel.org/r/20180503192430.7582-3-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>