[FFmpeg-devel] [PATCH] h264pred16x16 plane sse2/ssse3 optimizations

Ronald S. Bultje rsbultje
Thu Sep 30 15:49:14 CEST 2010


Hi,

On Wed, Sep 29, 2010 at 9:17 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Wed, Sep 29, 2010 at 08:56:13PM -0400, Ronald S. Bultje wrote:
>> On Wed, Sep 29, 2010 at 8:51 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> > On Tue, Sep 28, 2010 at 10:31:51PM -0400, Ronald S. Bultje wrote:
>> >> + ? ?lea ? ? ? ? ?r4, [r0+r2*8-1]
>> >> + ? ?lea ? ? ? ? ?r3, [r0+r2*4-1]
>> >> + ? ?add ? ? ? ? ?r4, r2
>> >> +
>> >> +%ifdef ARCH_X86_64
>> >> +%define e_reg r11
>> >> +%else
>> >> +%define e_reg r0
>> >> +%endif
>> >> +
>> >
>> > i see alot of r0-1 maybe r0 could be decreased by 1 somewhere?
>>
>> Yes, this is actually both smaller/simpler and also faster. Changed.
>>
>> >> + ? ?movzx ? ? e_reg, byte [r3+r1 ? ?]
>> >> + ? ?movzx ? ? ? ?r5, byte [r4+r2*2 ?]
>> >> + ? ?sub ? ? ? ? ?r5, e_reg
>> >> + ? ?shl ? ? ? ? ?r5, 2
>> >> +
>> >> + ? ?movzx ? ? e_reg, byte [r3 ? ? ? ]
>> >> + ? ?movzx ? ? ? ?r6, byte [r4+r2 ? ?]
>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*4]
>> >> + ? ?sub ? ? ? ? ?r5, r6
>> >> +
>> >> + ? ?movzx ? ? e_reg, byte [r3+r2 ? ?]
>> >> + ? ?movzx ? ? ? ?r6, byte [r4 ? ? ? ]
>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*2]
>> >> +
>> >> + ? ?movzx ? ? e_reg, byte [r3+r2*2 ?]
>> >> + ? ?movzx ? ? ? ?r6, byte [r4+r1 ? ?]
>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>> >> + ? ?add ? ? ? ? ?r5, r6
>> >
>> > this and the shl 2 case look like they could be merged like
>> > add+shl->lea
>>
>> Also changed.
>>
>> >> + ? ?lea ? ? ? ? ?r3, [r4+r2*4 ?]
>> >> +
>> >> + ? ?movzx ? ? e_reg, byte [r0+r1 ?-1]
>> >> + ? ?movzx ? ? ? ?r6, byte [r3+r2*2 ?]
>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*8]
>> >> +
>> >> + ? ?movzx ? ? e_reg, byte [r0 ? ? -1]
>> >> + ? ?movzx ? ? ? ?r6, byte [r3+r2 ? ?]
>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*8]
>> >> + ? ?sub ? ? ? ? ?r5, r6
>> >
>> > the *7 with lea + sub can maybe be changed to a add into the *8 case and a
>> > subtract (replacing lea by add)
>> >
>> >> + ? ?movzx ? ? e_reg, byte [r0+r2 ?-1]
>> >> + ? ?movzx ? ? ? ?r6, byte [r3 ? ? ? ]
>> >> + ? ?sub ? ? ? ? ?r6, e_reg
>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*4]
>> >> + ? ?lea ? ? ? ? ?r5, [r5+r6*2]
>> >
>> > this could add into *4 and *2 cases to replace the 2 leas by 2 adds
>> > or to leas *2 into the *3 case redusing the 2 leas to 1
>> > similar tricks may be possible elsewhere
>>
>> I didn't quite get these two, what exactly would you like me to try?
>
> a+=8*c
> a+=8*b
> a-=b
>
> to
>
> c+=b
> a+=8*c
> a-=b
>
> ----
> a+=2*b
> a+=b
> a+=2*c
> a+=4*c
>
> to
>
> b+=2*c
> a+=2*b
> a+=b

I see, OK, will work on that. In the mean time, I wrote mmx/mmx2 and
8x8 versions (409->139 cycles for U+V on ssse3 x86-64 cathedral
sample). Attached so I have a backup somewhere in case stuff breaks.
I'll work on the above next. Feel free to not review the current patch
until I've fixed the above.

Ronald
-------------- next part --------------
A non-text attachment was scrubbed...
Name: h264pred_pred16x16planecompat_simd.patch
Type: application/octet-stream
Size: 19121 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100930/b881ec5e/attachment-0001.obj>



More information about the ffmpeg-devel mailing list