[FFmpeg-devel] [PATCH 3/3] Use DSPContext.vector_fmul() and DSPContext.vector_fmul_reverse() in floating-point version of apply_window(). 46% faster in function apply_window().
Tue Jan 18 16:52:04 CET 2011
On 01/18/2011 10:42 AM, Michael Niedermayer wrote:
> On Wed, Jan 05, 2011 at 04:32:40PM -0500, Justin Ruggles wrote:
>> On 01/05/2011 04:06 PM, Loren Merritt wrote:
>>> On Tue, 4 Jan 2011, Justin Ruggles wrote:
>>>> Currently we have vector_fmul() for: C, neon, vfp, altivec, 3dnow, sse
>>>> I implemented vector_fmul_copy() for C, altivec, 3dnow, and sse to use 2
>>>> src and 1 dst. The Altivec version of vector_fmul_copy() has not been
>>>> tested, but I implemented it in the hope that someone else will test and
>>>> review it. Here are some benchmarks on my Athlon64. benchmark numbers
>>>> are in dezicycles.
>>>> I also tried to rewrite the current C version in SSE. It was faster
>>>> than the fmul_copy+fmul_reverse since it basically merges the 2 loops,
>>>> but it was slower than vector_fmul_copy(512). I left that out of the
>>>> patch. If anyone is interested I can send it...
>>> I predict that all of the vector_fmul_* mentioned here are memory-bound on
>>> intel and arithmetic-bound on amd.
>>> Is there any reason to keep both the 2-arg and 3-arg version of
>> I tested using vector_fmul_copy with same value for src0 and dst and it
>> ended up being slower. I thought it was weird, so I kept both versions.
>> Maybe I did something wrong in my tests though...
>> Also, I'll try benchmarking these on my laptop (Intel Atom 330, 64-bit
> Is there a patch i should review left in this thread or should i be waiting
> for a new one?
new patch attached. i did more testing, and changing the existing
vector_fmul() works fine.
More information about the ffmpeg-devel