[FFmpeg-devel] [RFC][PATCH] DSPUtilize some functions from APE decoder

Thu Jul 3 02:29:07 CEST 2008

On 7/3/08, Loren Merritt <lorenm at u.washington.edu> wrote:
> On Wed, 2 Jul 2008, Kostya wrote:
>
>> I'm not satisfied with the decoding speed of APE decoder,
>> so I've decided to finally dsputilize functions marked as such.
>
>> +static void vector_int16_add_sse(int16_t * v1, int16_t * v2, int order)
>
> sse2
>
>> +       "movdqa  (%0),   %%xmm0 \n\t"
>> +       "movdqu  (%1),   %%xmm1 \n\t"
>> +       "paddw   %%xmm1, %%xmm0 \n\t"
>
> movdqu  (%1),   %%xmm0
> paddw   (%0),   %%xmm0
>
>> +static int32_t vector_int16_scalarproduct_sse(int16_t * v1, int16_t * v2,
>> int order)
>> +{
>> +    int i;
>> +    int res = 0, *resp=&res;
>> +
>> +    asm volatile("pxor %xmm7, %xmm7 \n\t");
>> +
>> +    for(i = 0; i < order; i += 8){
>> +        asm volatile(
>> +       "movdqu   (%0),   %%xmm0 \n\t"
>> +       "movdqa   (%1),   %%xmm1 \n\t"
>> +       "pmaddwd  %%xmm1, %%xmm0 \n\t"
>> +       "movhlps  %%xmm0, %%xmm2 \n\t"
>> +
>> +       "paddd    %%xmm2, %%xmm0 \n\t"
>> +       "pshufd  $0x01, %%xmm0,%%xmm2 \n\t"
>> +       "paddd    %%xmm2, %%xmm0 \n\t"
>> +       "paddd   %%xmm0, %%xmm7 \n\t"
>> +       : "+r"(v1), "+r"(v2)
>> +       );
>> +       v1 += 8;
>> +       v2 += 8;
>> +    }
>> +    asm volatile("movd %%xmm7, (%0)\n\t" : "+r"(resp));
>> +    return res;
>> +}
>
> horizontal sum should be outside the loop
> pshuflw is faster than pshufd

Few more things.

What guarantees that these functions are called at 8 bytes aligned
addresses and that they always process the data in bunch of 8 (aka
order%8 ==0);
(I actually have no idea if the exact instructions you used require 8B
alignment, I just assume they do. If they don't, they are slow ;)

I think somewhere in the docs there is requirement to don't break
asm blocks just to do loop in C, this definitely would make you
use one variable/register for loop instead of 2.

I'm not sure why you use pointer to local variable,
there must be way to give the return variable directly
to the asm block, so if compiler pleases and that variable
is assigned to eax register then "movd" would put the value
in eax directly and return it this way.