[FFmpeg-devel] [PATCH] 'vorbis_residue_decode' optimizations

Sun Sep 28 02:33:10 CEST 2008

On Thursday 11 September 2008, Loren Merritt wrote:
> On Wed, 10 Sep 2008, Siarhei Siamashka wrote:
> > On Tuesday 09 September 2008, Loren Merritt wrote:
> >> You could try decoding residual in channel-interleaved order, do that
> >> consecutive codebook entries are consecutive in decoded memory. The simd
> >> savings might be worth an extra copy to deinterleave afterward.
> >
> > Do you suggest to deinterleave codebook entries beforehand on header
> > setup stage, so that when they are used in resude decode function later,
> > no extra 'shufps' SSE instructions would be needed? This might actually
> > work.
>
> Better than nothing, though it doesn't help dim2.

Yes, sure. All these more complicated optimizations are interesting and can be
considered later.

> >+ while ((step -= 4) >= 0) {
> >+     UPDATE_CACHE(re, gb)
> >+     VORBIS_GET_VLC(coffs, re, gb, vlc_table, codebook_nb_bits,
> > codebook_nb_bits_mask, 3, 1) +     asm volatile ("movlps 0(%0,%1,8),
> > %%xmm0 \n" : : "r" (codevectors), "r" (coffs)); +    
> > VORBIS_GET_VLC(coffs, re, gb, vlc_table, codebook_nb_bits,
> > codebook_nb_bits_mask, 3, 0) +     asm volatile ("movhps (%0,%1,8),
> > %%xmm0 \n" : : "r" (codevectors), "r" (coffs)); +     UPDATE_CACHE(re,
> > gb)
> >+     VORBIS_GET_VLC(coffs, re, gb, vlc_table, codebook_nb_bits,
> > codebook_nb_bits_mask, 3, 1) +     asm volatile ("movlps 0(%0,%1,8),
> > %%xmm1 \n" : : "r" (codevectors), "r" (coffs)); +    
> > VORBIS_GET_VLC(coffs, re, gb, vlc_table, codebook_nb_bits,
> > codebook_nb_bits_mask, 3, 0) +     asm volatile ("movhps (%0,%1,8),
> > %%xmm1 \n" : : "r" (codevectors), "r" (coffs)); +     asm volatile
> > ("movaps %xmm0, %xmm3 \n");
> >+     asm volatile ("shufps $0x88, %xmm1, %xmm0 \n");
> >+     asm volatile ("shufps $0xDD, %xmm1, %xmm3 \n");
> >+     asm volatile ("movaps 0(%0), %%xmm4 \n" : : "r" (p1));
> >+     asm volatile ("movaps 0(%0), %%xmm5 \n" : : "r" (p2));
> >+     asm volatile ("addps %xmm0, %xmm4 \n");
> >+     asm volatile ("addps %xmm3, %xmm5 \n");
>
> asm volatile ("addps (%0), %%xmm0 \n" : : "r" (p1));
> asm volatile ("addps (%0), %%xmm3 \n" : : "r" (p2));

Thanks, a good catch.

> >+ if (step & 2) {
> >+     UPDATE_CACHE(re, gb)
> >+     VORBIS_GET_VLC(coffs, re, gb, vlc_table, codebook_nb_bits,
> > codebook_nb_bits_mask, 3, 1) +     asm volatile ("movlps 0(%0,%1,8),
> > %%xmm0 \n" : : "r" (codevectors), "r" (coffs)); +    
> > VORBIS_GET_VLC(coffs, re, gb, vlc_table, codebook_nb_bits,
> > codebook_nb_bits_mask, 3, 0) +     asm volatile ("movhps (%0,%1,8),
> > %%xmm0 \n" : : "r" (codevectors), "r" (coffs)); +     asm volatile
> > ("shufps $0xD8, %xmm0, %xmm0 \n");
>
> unpcklps is faster than shufps

Yes, but anyway this part of code is actually never used. For the files that
are encoded by oggenc, step is typically a multiple of 8. We usually get the
values 8 and 16 for 'step' in 'vorbis_residue_decode_type2_ch2_dim2_8bit' for
example. This also makes it possible to unroll the loop more and have 3
UPDATE_CACHE operations per 8 VORBIS_GET_VLC operations instead of 2 per 4
now.

So what should we do next in order to have this work finalized and move on to
other optimizations? There were a few weeks for everyone to comment about the
patch and run some benchmarks.

I would prefer to have bitstream related stuff committed separately as it is
an optimization that is independent from SSE part. So that each part can be
benchmarked separately. And I'm more interested in ARM assembly optimizations
and not x86 :)

-- 
Best regards,
Siarhei Siamashka