[FFmpeg-devel] [RFC/PATCH] More flexible variant of float_to_int16, WMA optimization, Vorbis

Michael Niedermayer michaelni
Mon Jul 7 23:44:58 CEST 2008


On Mon, Jul 07, 2008 at 10:39:32AM +0300, Siarhei Siamashka wrote:
> Hi,
> 
> Here is a patch which adds a bit more flexible variant of 'float_to_int16'
> function ('more_flexible_variant_of_float_to_int16.diff').
> 
> It can be used for quite a noticeable WMA decoding performance improvement
> ('float_to_int16_wma.diff'), which is at least ~15% in my tests. Using current
> 'float_to_int16' is hard for WMA without introducing unnecessary intermediate
> operations involving interleaving samples in temporary buffer.

Maybe, but this doesnt mean we should not benchmark that intermediate
operation + current float_to_int16().
I really do not care how much faster your new asm code is compared to ISO C.
What i am interrested in is how it does compare to the current asm code.


> 
> Currently 'dca.c', 'ac3dec.c' use extra code for interleaving samples and can 
> be optimized.
> 
> Also 'float_to_int16_vorbis.diff' contains a patch which moves channel
> interleaving logic from 'vector_*' function to 'float_to_int16_*'. It
> simplifies the logic in 'vorbis_parse_audio_packet' and creates 
> opportunities for further optimizations. Also it makes vorbis decoding
> a bit faster (something like ~1.5% in my tests on Pentium-M) because of 

Ive just optimized float_to_int16 a little (it matches your proposed code
in terms of optimizations now)
So i think you should redo the benchmarks


[...]
> Regarding the subject, does it make sense to completely replace current 
> 'float_to_int16' function and use a new one instead? Using new function
> instead of old one is simple (though a bit cumbersome because it would 
> require creating a temporary array with a single entry, holding a pointer to
> samples). And using old function is problematic for at least WMA, DCA, AC3.
> 
> Problems to solve are efficient handling of non 1 or 2 channels case. It needs
> to be investigated if a generic variant can be optimized well (at least it
> should be faster than manual interleaving of samples) and what other special
> number of channels cases should be handled. Also I can do ARM VFP 
> optimizations, but 3NOW and Altivec versions would be needed.

It has been discussed already a few times ....
our decoders should output their native format, that may be planar floats
for several of these codecs
And a converter (containing your proposed float_to_int16) could
then convert and interleave this when the user app / audio out hardware
can not handle planar audio.
Are you interrested to work on such converter?


[...]
> Index: libavcodec/i386/dsputil_mmx.c
> ===================================================================
> --- libavcodec/i386/dsputil_mmx.c	(revision 14080)
> +++ libavcodec/i386/dsputil_mmx.c	(working copy)
> @@ -2052,6 +2052,43 @@
>      asm volatile("emms");
>  }
>  
> +static void float_to_int16_join_channels_sse(int16_t *dst, const float *src[], int len, int ch){
> +    int i, k;
> +    if (ch == 1) {
> +        float_to_int16_sse(dst, src[0], len);
> +    } else if (ch == 2 && !(len & 3)) {
> +        x86_reg idx = -(len*4);
> +        asm volatile(
> +            "1: \n\t"
> +            "cvtps2pi    0(%2, %0), %%mm0    \n\t"
> +            "cvtps2pi    0(%3, %0), %%mm1    \n\t"
> +            "cvtps2pi    8(%2, %0), %%mm2    \n\t"
> +            "cvtps2pi    8(%3, %0), %%mm3    \n\t"
> +            "add         $16, %0             \n\t"
> +            "packssdw    %%mm1, %%mm0        \n\t"
> +            "packssdw    %%mm3, %%mm2        \n\t"
> +            "pshufw      $0xD8, %%mm0, %%mm0 \n\t"
> +            "pshufw      $0xD8, %%mm2, %%mm2 \n\t"
> +            "movq        %%mm0, -16(%1, %0)  \n\t"
> +            "movq        %%mm2, -8(%1, %0)   \n\t"
> +            "js          1b                  \n\t"
> +            "emms                            \n\t"
> +            : "+r" (idx)
> +            : "r" (dst + len*2), "r"(src[0] + len), "r"(src[1] + len)
> +            : "cc", "memory"
> +        );
> +    } else {
> +        for(k = 0; k < ch; k++) {
> +            int16_t *ptr = dst + k;
> +            const float *iptr = src[k];
> +            for(i = 0; i < len; i++) {
> +                *ptr = av_clip_int16(lrintf(*iptr++));
> +                ptr += ch;
> +            }
> +        }
> +    }
> +}

I think the code in else will perform very poor speed wise on any
modern cpu which fill cache lines on write misses


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

I have often repented speaking, but never of holding my tongue.
-- Xenocrates
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080707/23646d1e/attachment.pgp>



More information about the ffmpeg-devel mailing list