[FFmpeg-devel] [PATCH] 'vorbis_residue_decode' optimizations

Wed Sep 10 10:00:24 CEST 2008

On Tuesday 09 September 2008, Loren Merritt wrote:
> On Tue, 9 Sep 2008, Siarhei Siamashka wrote:
> > On Wednesday 03 September 2008, Michael Niedermayer wrote:
> >> This could be added as a SHOW_CONST_UBITS
> >> also gcc should be able to build the mask itself at compile time as long
> >> as no asm shift tricks re used.
> >
> > Sure. The only problem is that it would be nice to use the same macro for
> > both constant and non-constant expressions. Adding one more macro does
> > not add much convenience because the compiler can't either insert a
> > constant or use asm shift trick automatically. Or can it?
>
> __builtin_constant_p

Thanks, this is interesting.

> > Some basic SSE optimizations are added, most likely they still can be
> > improved.
>
> You could try decoding residual in channel-interleaved order, do that
> consecutive codebook entries are consecutive in decoded memory. The simd
> savings might be worth an extra copy to deinterleave afterward.

Do you suggest to deinterleave codebook entries beforehand on header setup 
stage, so that when they are used in resude decode function later, no 
extra 'shufps' SSE instructions would be needed? This might actually work.

> Better yet but more complex would be to decode residual in channel-
> interleaved order and don't deinterleave. That would reduce the number of
> shuffles in mdct/fft (for 2 or 4 channels), but would require new fft
> asm.

The new fft sounds too complex at this stage, maybe this idea can be
reevaluated a bit later.

Also it might be possible to decode residual in interleaved order, manage
inverse coupling somehow and perform residual deinterleaving in 'vector_fmul'
at the dotproduct stage. I tried to get some statistics regarding the number
of addition operations in resudual decoding vs. multiplication operations
in dotproduct stage. They are generally of the same order of magnitude,
residual decoding having more operations on high bitrate files and less
operations on low bitrate files, unless I messed up the measurements. For
example:

64-kbit file

residue op count=7345968
dotproduct op count=11029504

256 kbit file:

residue op count=13073952
dotproduct op count=11029760

In any case, my first target was bitstream reader and vlc decoding related 
optimizations. Maybe this can be cleaned up (and committed) first, with SIMD
optimizations coming a bit later? I only added SSE code to show that this
whole stuff is quite SIMD friendly.

Also SSE inline assembly intermixed with C code assumes that C compiler does
not use XMM registers itself. Which happens to be true at the moment, but is 
not completely reliable, especially for x86-64. Maybe intrinsics are not a 
completely bad idea for this particular case?

By the way, I also managed to mess up a bit with the patch and kept some
redundant UPDATE_CACHE lines for dim4 and dim8 cases. Here is a fixed 
version attached, it is faster for low bitrate files now.

-- 
Best regards,
Siarhei Siamashka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vorbis_decode_residue_opt_try2_fixed.diff
Type: text/x-diff
Size: 27262 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080910/e191e3ff/attachment.diff>