[FFmpeg-devel] [PATCH] VP3 DC-only IDCT

Sat Mar 13 14:50:51 CET 2010

David Conrad <lessen42 at gmail.com> writes:

> Hi,
>
> This gives 2-4% faster overall decode for normal files.
>
> Some thoughts:
> I can't think of any shortcuts that could make the IDCT faster with 128-byte simd that don't rely on knowing the last non-zero coefficient.
>
> Knowing that before calling the idct, you could do a slightly faster IDCT that assumes the right and bottom of the block are all 0. This seems to be significantly faster only for mmx; for sse2 it's nearly a wash between the added check vs. the time saved.
>
> For an average video, around a third of all idcts are DC-only, a third more could be done with that shortcut (i.e. last_nnz is under 10), and the rest require a full IDCT.
>
> libtheora only does the 10 element shortcut, not DC-only. It also only has a mmx IDCT.
>
> I also haven't really looked at whether a DC-only IDCT is beneficial for mpeg codecs, thus the vp3-specific dsputil function.
>
>
> commit 0c4da1f09d90f7aec230b190195e063d51a2f3d8
> Author: David Conrad <lessen42 at gmail.com>
> Date:   Sat Mar 13 01:13:57 2010 -0500
>
>     vp3: DC-only IDCT
>
>     2-4% faster overall decode
>
> diff --git a/libavcodec/arm/dsputil_init_neon.c b/libavcodec/arm/dsputil_init_neon.c
> index 4a8de5e..9644748 100644
> --- a/libavcodec/arm/dsputil_init_neon.c
> +++ b/libavcodec/arm/dsputil_init_neon.c
> @@ -32,6 +32,7 @@ void ff_simple_idct_add_neon(uint8_t *dest, int line_size, DCTELEM *data);
>  void ff_vp3_idct_neon(DCTELEM *data);
>  void ff_vp3_idct_put_neon(uint8_t *dest, int line_size, DCTELEM *data);
>  void ff_vp3_idct_add_neon(uint8_t *dest, int line_size, DCTELEM *data);
> +void ff_vp3_idct_dc_add_neon(uint8_t *dest, int line_size, DCTELEM *data);
>
>  void ff_put_pixels16_neon(uint8_t *, const uint8_t *, int, int);
>  void ff_put_pixels16_x2_neon(uint8_t *, const uint8_t *, int, int);
> @@ -386,6 +387,7 @@ void ff_dsputil_init_neon(DSPContext *c, AVCodecContext *avctx)
>      if (CONFIG_VP3_DECODER) {
>          c->vp3_v_loop_filter = ff_vp3_v_loop_filter_neon;
>          c->vp3_h_loop_filter = ff_vp3_h_loop_filter_neon;
> +        c->vp3_idct_dc_add   = ff_vp3_idct_dc_add_neon;
>      }
>
>      c->vector_fmul                = ff_vector_fmul_neon;
> diff --git a/libavcodec/arm/vp3dsp_neon.S b/libavcodec/arm/vp3dsp_neon.S
> index 6deae47..ade1998 100644
> --- a/libavcodec/arm/vp3dsp_neon.S
> +++ b/libavcodec/arm/vp3dsp_neon.S
> @@ -374,3 +374,47 @@ function ff_vp3_idct_add_neon, export=1
>      vst1.64         {d7}, [r2,:64], r1
>      bx              lr
>  endfunc
> +
> +function ff_vp3_idct_dc_add_neon, export=1
> +    ldrsh           r2,  [r2]
> +    movw            r3,  #46341
> +    mul             r2,  r3,  r2
> +    smulwt          r2,  r3,  r2
> +    mov             r3,  r0
> +    vdup.16         q15, r2
> +    vrshr.s16       q15, q15, #4
> +
> +    vld1.8          {d0}, [r0,:64], r1
> +    vld1.8          {d1}, [r0,:64], r1
> +    vld1.8          {d2}, [r0,:64], r1
> +    vaddw.u8        q8,  q15, d0
> +    vld1.8          {d3}, [r0,:64], r1
> +    vaddw.u8        q9,  q15, d1
> +    vld1.8          {d4}, [r0,:64], r1
> +    vaddw.u8        q10, q15, d2
> +    vld1.8          {d5}, [r0,:64], r1
> +    vaddw.u8        q11, q15, d3
> +    vld1.8          {d6}, [r0,:64], r1
> +    vaddw.u8        q12, q15, d4
> +    vld1.8          {d7}, [r0,:64], r1
> +    vaddw.u8        q13, q15, d5
> +    vqmovun.s16     d0,  q8
> +    vaddw.u8        q14, q15, d6
> +    vqmovun.s16     d1,  q9
> +    vaddw.u8        q15, q15, d7
> +    vqmovun.s16     d2,  q10
> +    vst1.8          {d0}, [r3,:64], r1
> +    vqmovun.s16     d3,  q11
> +    vst1.8          {d1}, [r3,:64], r1
> +    vqmovun.s16     d4,  q12
> +    vst1.8          {d2}, [r3,:64], r1
> +    vqmovun.s16     d5,  q13
> +    vst1.8          {d3}, [r3,:64], r1
> +    vqmovun.s16     d6,  q14
> +    vst1.8          {d4}, [r3,:64], r1
> +    vqmovun.s16     d7,  q15
> +    vst1.8          {d5}, [r3,:64], r1
> +    vst1.8          {d6}, [r3,:64], r1
> +    vst1.8          {d7}, [r3,:64], r1
> +    bx              lr
> +endfunc

Looks good, assuming it works.

-- 
M?ns Rullg?rd
mans at mansr.com