linux

Author SHA1 Message Date

Author	SHA1	Message	Date
Arnd Bergmann	1cce91dfc8	ARM: 8715/1: add a private asm/unaligned.h The asm-generic/unaligned.h header provides two different implementations for accessing unaligned variables: the access_ok.h version used when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set pretends that all pointers are in fact aligned, while the le_struct.h version convinces gcc that the alignment of a pointer is '1', to make it issue the correct load/store instructions depending on the architecture flags. On ARMv5 and older, we always use the second version, to let the compiler use byte accesses. On ARMv6 and newer, we currently use the access_ok.h version, so the compiler can use any instruction including stm/ldm and ldrd/strd that will cause an alignment trap. This trap can significantly impact performance when we have to do a lot of fixups and, worse, has led to crashes in the LZ4 decompressor code that does not have a trap handler. This adds an ARM specific version of asm/unaligned.h that uses the le_struct.h/be_struct.h implementation unconditionally. This should lead to essentially the same code on ARMv6+ as before, with the exception of using regular load/store instructions instead of the trapping instructions multi-register variants. The crash in the LZ4 decompressor code was probably introduced by the patch replacing the LZ4 implementation, commit `4e1a33b105` ("lib: update LZ4 compressor module"), so linux-4.11 and higher would be affected most. However, we probably want to have this backported to all older stable kernels as well, to help with the performance issues. There are two follow-ups that I think we should also work on, but not backport to stable kernels, first to change the asm-generic version of the header to remove the ARM special case, and second to review all other uses of CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to see if they might be affected by the same problem on ARM. Cc: stable@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>	2017-10-24 10:33:23 +01:00
Rob Herring	d25c881aa3	ARM: 7493/1: use generic unaligned.h This moves ARM over to the asm-generic/unaligned.h header. This has the benefit of better code generated especially for ARMv7 on gcc 4.7+ compilers. As Arnd Bergmann, points out: The asm-generic version uses the "struct" version for native-endian unaligned access and the "byteshift" version for the opposite endianess. The current ARM version however uses the "byteshift" implementation for both. Thanks to Nicolas Pitre for the excellent analysis: Test case: int foo (int x) { return get_unaligned(x); } long long bar (long long x) { return get_unaligned(x); } With the current ARM version: foo: ldrb r3, [r0, #2] @ zero_extendqisi2 @ MEM[(const u8 )x_1(D) + 2B], MEM[(const u8 )x_1(D) + 2B] ldrb r1, [r0, #1] @ zero_extendqisi2 @ MEM[(const u8 )x_1(D) + 1B], MEM[(const u8 )x_1(D) + 1B] ldrb r2, [r0, #0] @ zero_extendqisi2 @ MEM[(const u8 )x_1(D)], MEM[(const u8 )x_1(D)] mov r3, r3, asl #16 @ tmp154, MEM[(const u8 )x_1(D) + 2B], ldrb r0, [r0, #3] @ zero_extendqisi2 @ MEM[(const u8 )x_1(D) + 3B], MEM[(const u8 )x_1(D) + 3B] orr r3, r3, r1, asl #8 @, tmp155, tmp154, MEM[(const u8 )x_1(D) + 1B], orr r3, r3, r2 @ tmp157, tmp155, MEM[(const u8 )x_1(D)] orr r0, r3, r0, asl #24 @,, tmp157, MEM[(const u8 )x_1(D) + 3B], bx lr @ bar: stmfd sp!, {r4, r5, r6, r7} @, mov r2, #0 @ tmp184, ldrb r5, [r0, #6] @ zero_extendqisi2 @ MEM[(const u8 )x_1(D) + 6B], MEM[(const u8 )x_1(D) + 6B] ldrb r4, [r0, #5] @ zero_extendqisi2 @ MEM[(const u8 )x_1(D) + 5B], MEM[(const u8 )x_1(D) + 5B] ldrb ip, [r0, #2] @ zero_extendqisi2 @ MEM[(const u8 )x_1(D) + 2B], MEM[(const u8 )x_1(D) + 2B] ldrb r1, [r0, #4] @ zero_extendqisi2 @ MEM[(const u8 )x_1(D) + 4B], MEM[(const u8 )x_1(D) + 4B] mov r5, r5, asl #16 @ tmp175, MEM[(const u8 )x_1(D) + 6B], ldrb r7, [r0, #1] @ zero_extendqisi2 @ MEM[(const u8 )x_1(D) + 1B], MEM[(const u8 )x_1(D) + 1B] orr r5, r5, r4, asl #8 @, tmp176, tmp175, MEM[(const u8 )x_1(D) + 5B], ldrb r6, [r0, #7] @ zero_extendqisi2 @ MEM[(const u8 )x_1(D) + 7B], MEM[(const u8 )x_1(D) + 7B] orr r5, r5, r1 @ tmp178, tmp176, MEM[(const u8 )x_1(D) + 4B] ldrb r4, [r0, #0] @ zero_extendqisi2 @ MEM[(const u8 )x_1(D)], MEM[(const u8 )x_1(D)] mov ip, ip, asl #16 @ tmp188, MEM[(const u8 )x_1(D) + 2B], ldrb r1, [r0, #3] @ zero_extendqisi2 @ MEM[(const u8 )x_1(D) + 3B], MEM[(const u8 )x_1(D) + 3B] orr ip, ip, r7, asl #8 @, tmp189, tmp188, MEM[(const u8 )x_1(D) + 1B], orr r3, r5, r6, asl #24 @,, tmp178, MEM[(const u8 )x_1(D) + 7B], orr ip, ip, r4 @ tmp191, tmp189, MEM[(const u8 )x_1(D)] orr ip, ip, r1, asl #24 @, tmp194, tmp191, MEM[(const u8 )x_1(D) + 3B], mov r1, r3 @, orr r0, r2, ip @ tmp171, tmp184, tmp194 ldmfd sp!, {r4, r5, r6, r7} bx lr In both cases the code is slightly suboptimal. One may wonder why wasting r2 with the constant 0 in the second case for example. And all the mov's could be folded in subsequent orr's, etc. Now with the asm-generic version: foo: ldr r0, [r0, #0] @ unaligned @,* x bx lr @ bar: mov r3, r0 @ x, x ldr r0, [r0, #0] @ unaligned @,* x ldr r1, [r3, #4] @ unaligned @, bx lr @ This is way better of course, but only because this was compiled for ARMv7. In this case the compiler knows that the hardware can do unaligned word access. This isn't that obvious for foo(), but if we remove the get_unaligned() from bar as follows: long long bar (long long x) {return x; } then the resulting code is: bar: ldmia r0, {r0, r1} @ x,, bx lr @ So this proves that the presumed aligned vs unaligned cases does have influence on the instructions the compiler may use and that the above unaligned code results are not just an accident. Still... this isn't fully conclusive without at least looking at the resulting assembly fron a pre ARMv6 compilation. Let's see with an ARMv5 target: foo: ldrb r3, [r0, #0] @ zero_extendqisi2 @ tmp139,* x ldrb r1, [r0, #1] @ zero_extendqisi2 @ tmp140, ldrb r2, [r0, #2] @ zero_extendqisi2 @ tmp143, ldrb r0, [r0, #3] @ zero_extendqisi2 @ tmp146, orr r3, r3, r1, asl #8 @, tmp142, tmp139, tmp140, orr r3, r3, r2, asl #16 @, tmp145, tmp142, tmp143, orr r0, r3, r0, asl #24 @,, tmp145, tmp146, bx lr @ bar: stmfd sp!, {r4, r5, r6, r7} @, ldrb r2, [r0, #0] @ zero_extendqisi2 @ tmp139,* x ldrb r7, [r0, #1] @ zero_extendqisi2 @ tmp140, ldrb r3, [r0, #4] @ zero_extendqisi2 @ tmp149, ldrb r6, [r0, #5] @ zero_extendqisi2 @ tmp150, ldrb r5, [r0, #2] @ zero_extendqisi2 @ tmp143, ldrb r4, [r0, #6] @ zero_extendqisi2 @ tmp153, ldrb r1, [r0, #7] @ zero_extendqisi2 @ tmp156, ldrb ip, [r0, #3] @ zero_extendqisi2 @ tmp146, orr r2, r2, r7, asl #8 @, tmp142, tmp139, tmp140, orr r3, r3, r6, asl #8 @, tmp152, tmp149, tmp150, orr r2, r2, r5, asl #16 @, tmp145, tmp142, tmp143, orr r3, r3, r4, asl #16 @, tmp155, tmp152, tmp153, orr r0, r2, ip, asl #24 @,, tmp145, tmp146, orr r1, r3, r1, asl #24 @,, tmp155, tmp156, ldmfd sp!, {r4, r5, r6, r7} bx lr Compared to the initial results, this is really nicely optimized and I couldn't do much better if I were to hand code it myself. Signed-off-by: Rob Herring <rob.herring@calxeda.com> Reviewed-by: Nicolas Pitre <nico@linaro.org> Tested-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>	2012-08-25 09:22:30 +01:00
Russell King	4baa992243	[ARM] move include/asm-arm to arch/arm/include/asm Move platform independent header files to arch/arm/include/asm, leaving those in asm/arch* and asm/plat* alone. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>	2008-08-02 21:32:35 +01:00

Arnd Bergmann

1cce91dfc8

ARM: 8715/1: add a private asm/unaligned.h

The asm-generic/unaligned.h header provides two different implementations
for accessing unaligned variables: the access_ok.h version used when
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set pretends that all pointers
are in fact aligned, while the le_struct.h version convinces gcc that the
alignment of a pointer is '1', to make it issue the correct load/store
instructions depending on the architecture flags.

On ARMv5 and older, we always use the second version, to let the compiler
use byte accesses. On ARMv6 and newer, we currently use the access_ok.h
version, so the compiler can use any instruction including stm/ldm and
ldrd/strd that will cause an alignment trap. This trap can significantly
impact performance when we have to do a lot of fixups and, worse, has
led to crashes in the LZ4 decompressor code that does not have a trap
handler.

This adds an ARM specific version of asm/unaligned.h that uses the
le_struct.h/be_struct.h implementation unconditionally. This should lead
to essentially the same code on ARMv6+ as before, with the exception of
using regular load/store instructions instead of the trapping instructions
multi-register variants.

The crash in the LZ4 decompressor code was probably introduced by the
patch replacing the LZ4 implementation, commit 4e1a33b105 ("lib: update
LZ4 compressor module"), so linux-4.11 and higher would be affected most.
However, we probably want to have this backported to all older stable
kernels as well, to help with the performance issues.

There are two follow-ups that I think we should also work on, but not
backport to stable kernels, first to change the asm-generic version of
the header to remove the ARM special case, and second to review all
other uses of CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to see if they
might be affected by the same problem on ARM.

Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>

2017-10-24 10:33:23 +01:00

Rob Herring

d25c881aa3

ARM: 7493/1: use generic unaligned.h

This moves ARM over to the asm-generic/unaligned.h header. This has the
benefit of better code generated especially for ARMv7 on gcc 4.7+
compilers.

As Arnd Bergmann, points out: The asm-generic version uses the "struct"
version for native-endian unaligned access and the "byteshift" version
for the opposite endianess. The current ARM version however uses the
"byteshift" implementation for both.

Thanks to Nicolas Pitre for the excellent analysis:

Test case:

int foo (int *x) { return get_unaligned(x); }
long long bar (long long *x) { return get_unaligned(x); }

With the current ARM version:

foo:
	ldrb	r3, [r0, #2]	@ zero_extendqisi2	@ MEM[(const u8 *)x_1(D) + 2B], MEM[(const u8 *)x_1(D) + 2B]
	ldrb	r1, [r0, #1]	@ zero_extendqisi2	@ MEM[(const u8 *)x_1(D) + 1B], MEM[(const u8 *)x_1(D) + 1B]
	ldrb	r2, [r0, #0]	@ zero_extendqisi2	@ MEM[(const u8 *)x_1(D)], MEM[(const u8 *)x_1(D)]
	mov	r3, r3, asl #16	@ tmp154, MEM[(const u8 *)x_1(D) + 2B],
	ldrb	r0, [r0, #3]	@ zero_extendqisi2	@ MEM[(const u8 *)x_1(D) + 3B], MEM[(const u8 *)x_1(D) + 3B]
	orr	r3, r3, r1, asl #8	@, tmp155, tmp154, MEM[(const u8 *)x_1(D) + 1B],
	orr	r3, r3, r2	@ tmp157, tmp155, MEM[(const u8 *)x_1(D)]
	orr	r0, r3, r0, asl #24	@,, tmp157, MEM[(const u8 *)x_1(D) + 3B],
	bx	lr	@

bar:
	stmfd	sp!, {r4, r5, r6, r7}	@,
	mov	r2, #0	@ tmp184,
	ldrb	r5, [r0, #6]	@ zero_extendqisi2	@ MEM[(const u8 *)x_1(D) + 6B], MEM[(const u8 *)x_1(D) + 6B]
	ldrb	r4, [r0, #5]	@ zero_extendqisi2	@ MEM[(const u8 *)x_1(D) + 5B], MEM[(const u8 *)x_1(D) + 5B]
	ldrb	ip, [r0, #2]	@ zero_extendqisi2	@ MEM[(const u8 *)x_1(D) + 2B], MEM[(const u8 *)x_1(D) + 2B]
	ldrb	r1, [r0, #4]	@ zero_extendqisi2	@ MEM[(const u8 *)x_1(D) + 4B], MEM[(const u8 *)x_1(D) + 4B]
	mov	r5, r5, asl #16	@ tmp175, MEM[(const u8 *)x_1(D) + 6B],
	ldrb	r7, [r0, #1]	@ zero_extendqisi2	@ MEM[(const u8 *)x_1(D) + 1B], MEM[(const u8 *)x_1(D) + 1B]
	orr	r5, r5, r4, asl #8	@, tmp176, tmp175, MEM[(const u8 *)x_1(D) + 5B],
	ldrb	r6, [r0, #7]	@ zero_extendqisi2	@ MEM[(const u8 *)x_1(D) + 7B], MEM[(const u8 *)x_1(D) + 7B]
	orr	r5, r5, r1	@ tmp178, tmp176, MEM[(const u8 *)x_1(D) + 4B]
	ldrb	r4, [r0, #0]	@ zero_extendqisi2	@ MEM[(const u8 *)x_1(D)], MEM[(const u8 *)x_1(D)]
	mov	ip, ip, asl #16	@ tmp188, MEM[(const u8 *)x_1(D) + 2B],
	ldrb	r1, [r0, #3]	@ zero_extendqisi2	@ MEM[(const u8 *)x_1(D) + 3B], MEM[(const u8 *)x_1(D) + 3B]
	orr	ip, ip, r7, asl #8	@, tmp189, tmp188, MEM[(const u8 *)x_1(D) + 1B],
	orr	r3, r5, r6, asl #24	@,, tmp178, MEM[(const u8 *)x_1(D) + 7B],
	orr	ip, ip, r4	@ tmp191, tmp189, MEM[(const u8 *)x_1(D)]
	orr	ip, ip, r1, asl #24	@, tmp194, tmp191, MEM[(const u8 *)x_1(D) + 3B],
	mov	r1, r3	@,
	orr	r0, r2, ip	@ tmp171, tmp184, tmp194
	ldmfd	sp!, {r4, r5, r6, r7}
	bx	lr

In both cases the code is slightly suboptimal.  One may wonder why
wasting r2 with the constant 0 in the second case for example.  And all
the mov's could be folded in subsequent orr's, etc.

Now with the asm-generic version:

foo:
	ldr	r0, [r0, #0]	@ unaligned	@,* x
	bx	lr	@

bar:
	mov	r3, r0	@ x, x
	ldr	r0, [r0, #0]	@ unaligned	@,* x
	ldr	r1, [r3, #4]	@ unaligned	@,
	bx	lr	@

This is way better of course, but only because this was compiled for
ARMv7. In this case the compiler knows that the hardware can do
unaligned word access.  This isn't that obvious for foo(), but if we
remove the get_unaligned() from bar as follows:

long long bar (long long *x) {return *x; }

then the resulting code is:

bar:
	ldmia	r0, {r0, r1}	@ x,,
	bx	lr	@

So this proves that the presumed aligned vs unaligned cases does have
influence on the instructions the compiler may use and that the above
unaligned code results are not just an accident.

Still... this isn't fully conclusive without at least looking at the
resulting assembly fron a pre ARMv6 compilation.  Let's see with an
ARMv5 target:

foo:
	ldrb	r3, [r0, #0]	@ zero_extendqisi2	@ tmp139,* x
	ldrb	r1, [r0, #1]	@ zero_extendqisi2	@ tmp140,
	ldrb	r2, [r0, #2]	@ zero_extendqisi2	@ tmp143,
	ldrb	r0, [r0, #3]	@ zero_extendqisi2	@ tmp146,
	orr	r3, r3, r1, asl #8	@, tmp142, tmp139, tmp140,
	orr	r3, r3, r2, asl #16	@, tmp145, tmp142, tmp143,
	orr	r0, r3, r0, asl #24	@,, tmp145, tmp146,
	bx	lr	@

bar:
	stmfd	sp!, {r4, r5, r6, r7}	@,
	ldrb	r2, [r0, #0]	@ zero_extendqisi2	@ tmp139,* x
	ldrb	r7, [r0, #1]	@ zero_extendqisi2	@ tmp140,
	ldrb	r3, [r0, #4]	@ zero_extendqisi2	@ tmp149,
	ldrb	r6, [r0, #5]	@ zero_extendqisi2	@ tmp150,
	ldrb	r5, [r0, #2]	@ zero_extendqisi2	@ tmp143,
	ldrb	r4, [r0, #6]	@ zero_extendqisi2	@ tmp153,
	ldrb	r1, [r0, #7]	@ zero_extendqisi2	@ tmp156,
	ldrb	ip, [r0, #3]	@ zero_extendqisi2	@ tmp146,
	orr	r2, r2, r7, asl #8	@, tmp142, tmp139, tmp140,
	orr	r3, r3, r6, asl #8	@, tmp152, tmp149, tmp150,
	orr	r2, r2, r5, asl #16	@, tmp145, tmp142, tmp143,
	orr	r3, r3, r4, asl #16	@, tmp155, tmp152, tmp153,
	orr	r0, r2, ip, asl #24	@,, tmp145, tmp146,
	orr	r1, r3, r1, asl #24	@,, tmp155, tmp156,
	ldmfd	sp!, {r4, r5, r6, r7}
	bx	lr

Compared to the initial results, this is really nicely optimized and I
couldn't do much better if I were to hand code it myself.

Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Reviewed-by: Nicolas Pitre <nico@linaro.org>
Tested-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

2012-08-25 09:22:30 +01:00

Russell King

4baa992243

[ARM] move include/asm-arm to arch/arm/include/asm

Move platform independent header files to arch/arm/include/asm, leaving
those in asm/arch* and asm/plat* alone.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>

2008-08-02 21:32:35 +01:00

3 Commits