[FFmpeg-devel] [PATCH] MMX VP3 Loop Filter

Michael Niedermayer michaelni
Sun Oct 12 11:51:18 CEST 2008


On Sat, Oct 11, 2008 at 08:40:23PM -0400, David Conrad wrote:
> On Oct 11, 2008, at 6:03 AM, Michael Niedermayer wrote:
>
>> On Sat, Oct 11, 2008 at 04:53:24AM -0400, David Conrad wrote:
>>> On Oct 8, 2008, at 1:59 AM, David Conrad wrote:
>>>
>>>> On Oct 7, 2008, at 5:43 AM, Jason Garrett-Glaser wrote:
>>>>
>>>>>> Here's an 8-bit version. However, checking for the C fallback negates
>>>>>> the
>>>>>> small speedup on my Penryn compared to the 16-bit version.
>>>>>
>>>>> Most of the code is still 16-bit.  Are you sure this can't be done
>>>>> x264-style with emulation of extra bits and 8-bit math (reference for
>>>>> an example of how to do this: common/x86/deblock-a.asm in x264 tree)?
>>>>> This would eliminate the need for all unpacks, all packs, and all
>>>>> multiplication, and probably increase speed dramatically.  I strongly
>>>>> suspect that it can be done, as the deblocking formulas seem very
>>>>> similar to those used in H.264.
>>>>
>>>> It seems like you're right; the only difference between DEBLOCK_P0_Q0 
>>>> and
>>>> VP3 is a *3 vs. a *4 in H.264.
>>>> I don't quite fully understand x264's implementation, so it'll take
>>>> another bit to adapt it.
>>>
>>> And here's an entirely 8-bit implementation. ~3 cycles faster than the 
>>> last
>>> patch I posted.
>>
>>> I'm not sure the best way to avoid the duplication of ff_pb_1/3/7
>>> constants; there aren't enough registers to pass the address of all of 
>>> the
>>> constants I need.
>>
>> try MANGLE()
>
> Done.
>
>> [...]
>>> +\
>>> +    "movd     "#flim", %%mm5 \n\t" \
>>> +    "punpcklbw  %%mm5, %%mm5 \n\t" \
>>
>> you could pass the thing from mm5 at the end of the bounding_values array,
>> this also would make filter_limit unneeded, avoid the *0x02020202 and the
>> punpcklbw
>
> Done.
>

[...]
> @@ -86,6 +88,20 @@ extern const double ff_pd_2[2];
>      SBUTTERFLY(a,c,d,dq,q) /* a=aeim d=bfjn */\
>      SBUTTERFLY(t,b,c,dq,q) /* t=cgko c=dhlp */
>  
> +#define TRANSPOSE8x4(a,b,c,d,e,f,g,h)\
> +    "punpcklbw " #e ", " #a " \n\t" /* a0 e0 a1 e1 a2 e2 a3 e3 */\
> +    "punpcklbw " #f ", " #b " \n\t" /* b0 f0 b1 f1 b2 f2 b3 f3 */\
> +    "punpcklbw " #g ", " #c " \n\t" /* c0 g0 c1 g1 c2 g2 d3 g3 */\
> +    "punpcklbw " #h ", " #d " \n\t" /* d0 h0 d1 h1 d2 h2 d3 h3 */\
> +    SBUTTERFLY(a, b, e, bw, q)   /* a= a0 b0 e0 f0 a1 b1 e1 f1 */\
> +                                 /* e= a2 b2 e2 f2 a3 b3 e3 f3 */\
> +    SBUTTERFLY(c, d, b, bw, q)   /* c= c0 d0 g0 h0 c1 d1 g1 h1 */\
> +                                 /* b= c2 d2 g2 h2 c3 d3 g3 h3 */\
> +    SBUTTERFLY(a, c, d, wd, q)   /* a= a0 b0 c0 d0 e0 f0 g0 h0 */\
> +                                 /* d= a1 b1 c1 d1 e1 f1 g1 h1 */\
> +    SBUTTERFLY(e, b, c, wd, q)   /* e= a2 b2 c2 d2 e2 f2 g2 h2 */\
> +                                 /* c= a3 b3 c3 d3 e3 f3 g3 h3 */

i dont know if it would be faster but punpcklbw could read from memory
making seperate movq unneeded


[...]
> +void ff_vp3_v_loop_filter_mmx(uint8_t *src, int stride, int *bounding_values)
> +{

> +    if (bounding_values[129] > 63*0x02020202) {
> +        ff_vp3_v_loop_filter_c(src, stride, bounding_values);
> +        return;
> +    }

it would be faster to not do this in the inner loop, though it would be
less clean ...

except these iam fine with the patch

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

When you are offended at any man's fault, turn to yourself and study your
own failings. Then you will forget your anger. -- Epictetus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20081012/710d4ae0/attachment.pgp>



More information about the ffmpeg-devel mailing list