[FFmpeg-devel] [PATCH] Some ARM VFP optimizations (vector_fmul, vector_fmul_reverse, float_to_int16)

Siarhei Siamashka siarhei.siamashka
Sun May 11 15:50:56 CEST 2008

On Monday 21 April 2008, Michael Niedermayer wrote:
> > So
> > one fmuls* instruction queues 4 multiplies, which get performed one after
> > another in arithmetic pipeline (occupying it for 4 cycles).
> Thats what i missed, i expected it to do them in parallel, like a real CPU
> :)

Even 'real' CPUs had problems with 128-bit SSE throughput (fixed in Core2)
because of not having enough execution units, and here we are talking about
the embedded core which has a much smaller number of transistors :)

This approach with scheduling work for pipelines is good enough if properly
used. Decoder throughput is 1 instruction per cycle. But multi-cycle
instructions can run simultaneously in different pipelines overlapping each
other, provided that they do not have resource conflicts (optimization manual
describes some rules of registers locking, they also need to be taken into
account carefully). If each instruction was executed in only one cycle, there
would be no way for getting parallel execution, because the decoder would be
the bottleneck.

Even without VFP, ARM11 core is able to perform simultaneous load/store and
arithmetic operations (LDM/STM - load and store multiple instructions take
1 cycle in the decoding stage and continue to execute in the background in
parallel with other instructions). This allows to get more than 1 operation
per cycle throughput.

I'm probably not very good at explaining things, so anyone interested in
optimizing code for ARM is encouraged to read ARM manuals instead.

> [...]
> > So the optimization manual from ARM provides only some simplified model
> > and can't guarantee exact results. I also tried to remove all the
> > multiplication instructions, keeping load/store operations only, the
> > performance remained exactly the same (while supposedly calculating
> > cycles for load/store operations should be trivial). The final code is a
> > result of some 'genetic' variations and taking the fastest version :)
> >
> > Oprofile shows that we get a lot of 'LSU_STALL' events, whatever it
> > means. So it probably has something to do with some data cache throughput
> > limitation which is not mentioned in the manual.
> google says:
> LSU_STALL : cycles stalled because Load Store request queque \
> is full

Well, the question actually was why I couldn't reach expected performance on
the theoretically perfectly scheduled code, with LSU_STALL events being the
only hardware performance counter indicating problems.

Looks like I found the answer. There is the following code in the linux kernel
in 'arch/arm/mm/proc-v6.S':

	/* Workaround for the 364296 ARM1136 r0pX errata (possible cache data
	 * corruption with hit-under-miss enabled). The conditional code below
	 * (setting the undocumented bit 31 in the auxiliary control register
	 * and the FI bit in the control register) disables hit-under-miss
	 * without putting the processor into full low interrupt latency mode.
	ldr	r6, =0x4107b360			@ id for ARM1136 r0pX
	mrc	p15, 0, r5, c0, c0, 0		@ get processor id
	bic	r5, r5, #0xf			@ mask out part bits [3:0]
	teq	r5, r6				@ check for the faulty core
	mrceq	p15, 0, r5, c1, c0, 1		@ load aux control reg
	orreq	r5, r5, #(1 << 31)		@ set the undocumented bit 31
	mcreq	p15, 0, r5, c1, c0, 1		@ write aux control reg
	orreq	r0, r0, #(1 << 21)		@ low interrupt latency configuration
Unfortunately both Nokia N800 and Nokia N810 use the ARM1136 core revision
which needs this workaround. This workaround was not applied in older versions
of Nokia Internet Tablets firmware (OS2007), so I could track it down when
looking for the reason why cache prefetch (PLD instruction) stopped working in
OS2008. I already mentioned this cache prefetch issue earlier, though did not
know what was the cause:

Commenting out this errata workaround in the kernel actually not only makes 
prefetch work again, but also improves performance of these VFP optimized
functions (looks LSU unit functionality got crippled by this workaround and
it also affected our code). Now the benchmark provides the following results:

./test-vfp 400
Function: 'vector_fmul_vfp', time=9.122 (cycles/element=1.782)
Function: 'vector_fmul_reverse_vfp', time=15.967 (cycles/element=3.119)
Function: 'float_to_int16_vfp', time=19.220 (cycles/element=3.754)
Function: 'ff_float_to_int16_c', time=73.718 (cycles/element=14.398)

Older result (with errata workaround applied) looked like this:

./test-vfp 400
Function: 'vector_fmul_vfp', time=9.792 (cycles/element=1.912)
Function: 'vector_fmul_reverse_vfp', time=16.594 (cycles/element=3.241)
Function: 'float_to_int16_vfp', time=23.162 (cycles/element=4.524)
Function: 'ff_float_to_int16_c', time=89.421 (cycles/element=17.465)

At least the results of 'vector_fmul_vfp' and 'float_to_int16_vfp' are now
much more close to the theoretical unreachable maximum throughput (1.5 
cycles per element for 'vector_fmul_vfp' and 3 cycles for 'float_to_int16_vfp'
respectively). Manually calculating cycles should have resulted in 1.5625
cycles for 'vector_fmul_vfp' and 3.375 for 'float_to_int16_vfp'.
Function 'vector_fmul_reverse_vfp' does not make the best use of instructions
scheduling to utilize both pipelines in the most efficient way, it was just
the fastest code that I could practically get with errata workaround active.

Attached is the latest revision of VFP optimizations patch with the following
cosmetic updates:
- removed information about the performance of functions from comments (as it
is not reliable and depends on the presence or absence of errata fix)
- renamed inline assembly named operands for 'vector_fmul_vfp' as Mans
did not like the old names

If you still see some problems with this patch, please provide a numbered list
of your requirements/wishes so that I can easily distinguish the important
stuff from just random questions, comments or flame. Thanks in advance.

PS. Doesn't this issue remind anybody the recent problems with AMD Phenom? ;)

Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: armvfp-try3.diff
Type: text/x-diff
Size: 9745 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080511/9c1a7901/attachment.diff>

More information about the ffmpeg-devel mailing list