[FFmpeg-devel] [PATCH] flac/x86: add ff_flac_lpc_32_sse4()
lorenm at u.washington.edu
Sat Feb 1 13:24:28 CET 2014
On Sat, 1 Feb 2014, James Almer wrote:
> On 01/02/14 1:38 AM, James Almer wrote:
> > x64
> > 1261661 decicycles in flac_lpc_32_c, 32768 runs
> > 1045689 decicycles in ff_flac_lpc_32_sse4, 32768 runs
> > 1431506 decicycles in flac_lpc_32_c, 32768 runs
> > 1209322 decicycles in ff_flac_lpc_32_sse4, 32768 runs
> > x86
> > 1429597 decicycles in flac_lpc_32_c, 32768 runs
> > 953667 decicycles in ff_flac_lpc_32_sse4, 32768 runs
> > 1610348 decicycles in flac_lpc_32_c, 32768 runs
> > 1079424 decicycles in ff_flac_lpc_32_sse4, 32768 runs
> > About 100 to 500 ms faster decoding using -threads 1 depending on song and arch.
> > Tested using a few 24 bits samples on an AMD FX 6300, Win7 x64 and x86.
> > Biggest speedup appears to be on x86 builds.
> > Signed-off-by: James Almer <jamrial at gmail.com>
> > ---
> > libavcodec/flacdsp.c | 2 ++
> > libavcodec/flacdsp.h | 1 +
> > libavcodec/x86/Makefile | 2 ++
> > libavcodec/x86/flacdsp.asm | 61 +++++++++++++++++++++++++++++++++++++++++++
> > libavcodec/x86/flacdsp_init.c | 39 +++++++++++++++++++++++++++
> > 5 files changed, 105 insertions(+)
> > create mode 100644 libavcodec/x86/flacdsp.asm
> > create mode 100644 libavcodec/x86/flacdsp_init.c
> Couldn't test with Valgrind, or on a Linux box for that matter.
> I have access to this FX 6300 for the time being so I used it to write this, but can't
> install a VM.
> I originally wrote this doing two calculations per packed instruction (using all 128
> bits on the xmm registers instead of 64), but after punpckldq-ing and pshufd-ing values
> around and adding extra checks for odd pred_order values it somehow ended up slower
> than the pure c implementation.
> This will do until i get that other version working faster. If i can, of course.
Did you try applying the optimization from flac_lpc_16_c to flac_lpc_32_c?
A simd implementation shouldn't need any shuffles, just leave the samples
in their natural order in the xmmregs and let a single pmuldq apply to
nonadjacent samples. You also shouldn't need any check on the parity of
pred_order if you zero-pad coefs.
More information about the ffmpeg-devel