[FFmpeg-devel] [PATCH] Some ARM VFP optimizations (vector_fmul, vector_fmul_reverse, float_to_int16)

Sun May 11 17:34:37 CEST 2008

On Sun, May 11, 2008 at 04:50:56PM +0300, Siarhei Siamashka wrote:
> On Monday 21 April 2008, Michael Niedermayer wrote:
> > > So
> > > one fmuls* instruction queues 4 multiplies, which get performed one after
> > > another in arithmetic pipeline (occupying it for 4 cycles).
> >
> > Thats what i missed, i expected it to do them in parallel, like a real CPU
> > :)
> 
> Even 'real' CPUs had problems with 128-bit SSE throughput (fixed in Core2)
> because of not having enough execution units, and here we are talking about
> the embedded core which has a much smaller number of transistors :)
> 
> This approach with scheduling work for pipelines is good enough if properly
> used. Decoder throughput is 1 instruction per cycle. But multi-cycle
> instructions can run simultaneously in different pipelines overlapping each
> other, provided that they do not have resource conflicts (optimization manual
> describes some rules of registers locking, they also need to be taken into
> account carefully). If each instruction was executed in only one cycle, there
> would be no way for getting parallel execution, because the decoder would be
> the bottleneck.
> 
> Even without VFP, ARM11 core is able to perform simultaneous load/store and
> arithmetic operations (LDM/STM - load and store multiple instructions take
> 1 cycle in the decoding stage and continue to execute in the background in
> parallel with other instructions). This allows to get more than 1 operation
> per cycle throughput.
> 
> I'm probably not very good at explaining things, so anyone interested in
> optimizing code for ARM is encouraged to read ARM manuals instead.
> 
> >
> > [...]
> >
> > > So the optimization manual from ARM provides only some simplified model
> > > and can't guarantee exact results. I also tried to remove all the
> > > multiplication instructions, keeping load/store operations only, the
> > > performance remained exactly the same (while supposedly calculating
> > > cycles for load/store operations should be trivial). The final code is a
> > > result of some 'genetic' variations and taking the fastest version :)
> > >
> > > Oprofile shows that we get a lot of 'LSU_STALL' events, whatever it
> > > means. So it probably has something to do with some data cache throughput
> > > limitation which is not mentioned in the manual.
> >
> > google says:
> > LSU_STALL : cycles stalled because Load Store request queque \
> > is full
> 
> Well, the question actually was why I couldn't reach expected performance on
> the theoretically perfectly scheduled code, with LSU_STALL events being the
> only hardware performance counter indicating problems.
> 
> Looks like I found the answer. There is the following code in the linux kernel
> in 'arch/arm/mm/proc-v6.S':
> 
> 	/* Workaround for the 364296 ARM1136 r0pX errata (possible cache data
> 	 * corruption with hit-under-miss enabled). The conditional code below
> 	 * (setting the undocumented bit 31 in the auxiliary control register
> 	 * and the FI bit in the control register) disables hit-under-miss
> 	 * without putting the processor into full low interrupt latency mode.
> 	 */
> 	ldr	r6, =0x4107b360			@ id for ARM1136 r0pX
> 	mrc	p15, 0, r5, c0, c0, 0		@ get processor id
> 	bic	r5, r5, #0xf			@ mask out part bits [3:0]
> 	teq	r5, r6				@ check for the faulty core
> 	mrceq	p15, 0, r5, c1, c0, 1		@ load aux control reg
> 	orreq	r5, r5, #(1 << 31)		@ set the undocumented bit 31
> 	mcreq	p15, 0, r5, c1, c0, 1		@ write aux control reg
> 	orreq	r0, r0, #(1 << 21)		@ low interrupt latency configuration
>  
> Unfortunately both Nokia N800 and Nokia N810 use the ARM1136 core revision
> which needs this workaround. This workaround was not applied in older versions
> of Nokia Internet Tablets firmware (OS2007), so I could track it down when
> looking for the reason why cache prefetch (PLD instruction) stopped working in
> OS2008. I already mentioned this cache prefetch issue earlier, though did not
> know what was the cause:
> http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2008-April/045931.html
> 
> Commenting out this errata workaround in the kernel actually not only makes 
> prefetch work again, but also improves performance of these VFP optimized
> functions (looks LSU unit functionality got crippled by this workaround and
> it also affected our code). Now the benchmark provides the following results:
> 
> ./test-vfp 400
> Function: 'vector_fmul_vfp', time=9.122 (cycles/element=1.782)
> Function: 'vector_fmul_reverse_vfp', time=15.967 (cycles/element=3.119)
> Function: 'float_to_int16_vfp', time=19.220 (cycles/element=3.754)
> Function: 'ff_float_to_int16_c', time=73.718 (cycles/element=14.398)
> 
> Older result (with errata workaround applied) looked like this:
> 
> ./test-vfp 400
> Function: 'vector_fmul_vfp', time=9.792 (cycles/element=1.912)
> Function: 'vector_fmul_reverse_vfp', time=16.594 (cycles/element=3.241)
> Function: 'float_to_int16_vfp', time=23.162 (cycles/element=4.524)
> Function: 'ff_float_to_int16_c', time=89.421 (cycles/element=17.465)
> 
> At least the results of 'vector_fmul_vfp' and 'float_to_int16_vfp' are now
> much more close to the theoretical unreachable maximum throughput (1.5 
> cycles per element for 'vector_fmul_vfp' and 3 cycles for 'float_to_int16_vfp'
> respectively). Manually calculating cycles should have resulted in 1.5625
> cycles for 'vector_fmul_vfp' and 3.375 for 'float_to_int16_vfp'.
> Function 'vector_fmul_reverse_vfp' does not make the best use of instructions
> scheduling to utilize both pipelines in the most efficient way, it was just
> the fastest code that I could practically get with errata workaround active.
> 
> Attached is the latest revision of VFP optimizations patch with the following
> cosmetic updates:
> - removed information about the performance of functions from comments (as it
> is not reliable and depends on the presence or absence of errata fix)
> - renamed inline assembly named operands for 'vector_fmul_vfp' as Mans
> did not like the old names
> 
> If you still see some problems with this patch, please provide a numbered list
> of your requirements/wishes so that I can easily distinguish the important
> stuff from just random questions, comments or flame. Thanks in advance.

patch looks ok

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I wish the Xiph folks would stop pretending they've got something they
do not.  Somehow I fear this will remain a wish. -- M?ns Rullg?rd
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080511/964c65fa/attachment.pgp>