[FFmpeg-devel] [PATCH] h264pred16x16 plane sse2/ssse3 optimizations

Ronald S. Bultje rsbultje
Fri Oct 1 04:08:33 CEST 2010


Hi,

On Wed, Sep 29, 2010 at 9:17 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Wed, Sep 29, 2010 at 08:56:13PM -0400, Ronald S. Bultje wrote:
>> On Wed, Sep 29, 2010 at 8:51 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> > On Tue, Sep 28, 2010 at 10:31:51PM -0400, Ronald S. Bultje wrote:
>> >> + ? ?lea ? ? ? ? ?r4, [r0+r2*8-1]
>> >> + ? ?lea ? ? ? ? ?r3, [r0+r2*4-1]
>> >> + ? ?add ? ? ? ? ?r4, r2
>> >> +
>> >> +%ifdef ARCH_X86_64
>> >> +%define e_reg r11
>> >> +%else
>> >> +%define e_reg r0
>> >> +%endif
>> >> +
>> >
>> > i see alot of r0-1 maybe r0 could be decreased by 1 somewhere?
>>
>> Yes, this is actually both smaller/simpler and also faster. Changed.
>>
>> >> + ? ?movzx ? ? e_reg, byte [r3+r1 ? ?]
>> >> + ? ?movzx ? ? ? ?r5, byte [r4+r2*2 ?]
>> >> + ? ?sub ? ? ? ? ?r5, e_reg
>> >> + ? ?shl ? ? ? ? ?r5, 2
>> >> +
>> >> + ? ?movzx ? ? e_reg, byte [r3 ? ? ? ]
>> >> + ? ?movzx ? ? ? ?r6, byte [r4+r2 ? ?]
>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*4]
>> >> + ? ?sub ? ? ? ? ?r5, r6
>> >> +
>> >> + ? ?movzx ? ? e_reg, byte [r3+r2 ? ?]
>> >> + ? ?movzx ? ? ? ?r6, byte [r4 ? ? ? ]
>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*2]
>> >> +
>> >> + ? ?movzx ? ? e_reg, byte [r3+r2*2 ?]
>> >> + ? ?movzx ? ? ? ?r6, byte [r4+r1 ? ?]
>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>> >> + ? ?add ? ? ? ? ?r5, r6
>> >
>> > this and the shl 2 case look like they could be merged like
>> > add+shl->lea
>>
>> Also changed.
>>
>> >> + ? ?lea ? ? ? ? ?r3, [r4+r2*4 ?]
>> >> +
>> >> + ? ?movzx ? ? e_reg, byte [r0+r1 ?-1]
>> >> + ? ?movzx ? ? ? ?r6, byte [r3+r2*2 ?]
>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*8]
>> >> +
>> >> + ? ?movzx ? ? e_reg, byte [r0 ? ? -1]
>> >> + ? ?movzx ? ? ? ?r6, byte [r3+r2 ? ?]
>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*8]
>> >> + ? ?sub ? ? ? ? ?r5, r6
>> >
>> > the *7 with lea + sub can maybe be changed to a add into the *8 case and a
>> > subtract (replacing lea by add)
>> >
>> >> + ? ?movzx ? ? e_reg, byte [r0+r2 ?-1]
>> >> + ? ?movzx ? ? ? ?r6, byte [r3 ? ? ? ]
>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*4]
>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*2]
>> >
>> > this could add into *4 and *2 cases to replace the 2 leas by 2 adds
>> > or to leas *2 into the *3 case redusing the 2 leas to 1
>> > similar tricks may be possible elsewhere
>>
>> I didn't quite get these two, what exactly would you like me to try?
>
> a+=8*c
> a+=8*b
> a-=b
>
> to
>
> c+=b
> a+=8*c
> a-=b
>
> ----
> a+=2*b
> a+=b
> a+=2*c
> a+=4*c
>
> to
>
> b+=2*c
> a+=2*b
> a+=b

OK, new patch attached. The caveat here is that on x86-32 I don't
think I have enough registers (I could do it in a really linear
path-way but then I'm affraid that'd make it slower on Atom or so), so
I only did this on x86-64. A little spaghetti-code maybe... Let me
know if that's OK or if you prefer the linear-way (that'd be saving
the result of the first, then use the same register for the two
movzx's and directly adding/subbing them from the stored register of
the previous two values), i.e.:

movzx a, [val1a]
movzx b, [val1b]
sub a, b
sub res, a

movzx b, [val2a]
add a, b
movzx b, [val2b]
sub a, b
lea res, [res+a*4/8]

As for performance, the second suggestion saved several cycles, the
first didn't really have an effect (0.2 cycle faster, i.e. probably
noise). I also added an ALIGN 16 to the 8x8. Otherwise unchanged. Make
fate-h264 still passes on both x86-64 and x86-32 (which is basically
unchanged).

Ronald
-------------- next part --------------
A non-text attachment was scrubbed...
Name: h264pred_pred16x16planecompat_simd.patch
Type: application/octet-stream
Size: 19722 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100930/d9dda52e/attachment.obj>



More information about the ffmpeg-devel mailing list