[FFmpeg-devel] [PATCH] 'vorbis_residue_decode' optimizations
Ramiro Polla
ramiro.polla
Mon Sep 1 00:57:37 CEST 2008
On Sun, Aug 31, 2008 at 7:51 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Sun, Aug 31, 2008 at 04:42:35PM +0200, Michael Niedermayer wrote:
>> On Sun, Aug 31, 2008 at 01:18:14PM +0300, Siarhei Siamashka wrote:
>> > On Sunday 31 August 2008, Michael Niedermayer wrote:
>> > > On Sat, Aug 30, 2008 at 11:42:31PM +0300, Siarhei Siamashka wrote:
>> > > > On Saturday 30 August 2008, Loren Merritt wrote:
>> > > > > On Sat, 30 Aug 2008, Siarhei Siamashka wrote:
>> > > > > > This trivial patch improves overall vorbis decoding performance by
>> > > > > > ~3% on Pentium-M with gcc 4.2.3
>> > > > >
>> > > > > vorbis_residue_decode_type# are superfluous. Just inline
>> > > > > vorbis_residue_decode_internal into vorbis_residue_decode.
>> > > >
>> > > > Theoretically they are superfluous (inlining
>> > > > vorbis_residue_decode_internal into vorbis_residue_decode was the first
>> > > > thing that I tried). But in practice code is consistently faster this
>> > > > way. Probably it is easier for gcc to optimize 3 independent functions
>> > > > than everything bundled into a huge one. Let me know if you get different
>> > > > results.
>> > >
>> > > well, I do
>> > >
>> > > [...]
>> > >
>> > > > --------------------
>> > > > callgrind simulation for './ffmpeg_g.1huge' (L1 data cache is 32K):
>> > > > I refs: 85,817,091
>> > > > D refs: 43,457,905 (28,888,575 rd + 14,569,330 wr)
>> > > > D1 misses: 785,564 ( 583,645 rd + 201,919 wr)
>> > > > D1 miss rate: 1.8% ( 2.0% + 1.3% )
>> > > > callgrind simulation for './ffmpeg_g.3func' (L1 data cache is 32K):
>> > > > I refs: 85,085,997
>> > > > D refs: 42,653,212 (28,454,961 rd + 14,198,251 wr)
>> > > > D1 misses: 782,978 ( 581,685 rd + 201,293 wr)
>> > > > D1 miss rate: 1.8% ( 2.0% + 1.4% )
>> > > >
>> > > > The difference is visible both for the total number of instructions and
>> > > > for the number of memory accesses.
>> > >
>> > > loren:
>> > > I refs: 5,663,789,738
>> > > I1 misses: 3,515,218
>> > > I1 miss rate: 0.06%
>> > > D refs: 1,889,318,408 (1,365,757,445 rd + 523,560,963 wr)
>> > > D1 misses: 32,073,499 ( 22,443,938 rd + 9,629,561 wr)
>> > > D1 miss rate: 1.6% ( 1.6% + 1.8% )
>> > >
>> > > siar:
>> > > I refs: 5,670,795,747
>> > > I1 misses: 3,488,120
>> > > I1 miss rate: 0.06%
>> > > D refs: 1,896,279,210 (1,372,731,243 rd + 523,547,967 wr)
>> > > D1 misses: 32,096,476 ( 22,464,805 rd + 9,631,671 wr)
>> > > D1 miss rate: 1.6% ( 1.6% + 1.8% )
>> >
>> > Took time to compile/install gcc 4.3.2 and also got similar results. What's
>> > more important, the fastest build generated by gcc 4.3.2 (all inlined) was
>> > better than the fastest build generated by 4.2.3 (dummy functions). This
>> > really makes the choice quite obvious :)
>> >
>> > > Ill commit the clean version without the dummy functions in a day or 2
>> > > unless someone objects / has some idea of how to improve it.
>> >
>> > I also tried to benchmark the variants where 'vlen' is also inlined as
>> > constants 128 and 1024 which are quite typical (with the hope that it could
>> > save 1 extra register for gcc in the inner loop) but effect on the
>> > performance was minimal.
>> >
>> > Regarding 'vorbis_residue_decode' function, it probably makes sense to
>> > optimize these loops:
>> >
>> > if(dim==2) {
>> > for(k=0;k<step;++k) {
>> > coffs=get_vlc2(gb, codebook.vlc.table, codebook.nb_bits, 3) * 2;
>> > vec[voffs+k ]+=codebook.codevectors[coffs ]; // FPMATH
>> > vec[voffs+k+vlen]+=codebook.codevectors[coffs+1]; // FPMATH
>> > }
>> > } else if(dim==4) {
>> > for(k=0;k<step;++k, voffs+=2) {
>> > coffs=get_vlc2(gb, codebook.vlc.table, codebook.nb_bits, 3) * 4;
>> > vec[voffs ]+=codebook.codevectors[coffs ]; // FPMATH
>> > vec[voffs+1 ]+=codebook.codevectors[coffs+2]; // FPMATH
>> > vec[voffs+vlen ]+=codebook.codevectors[coffs+1]; // FPMATH
>> > vec[voffs+vlen+1]+=codebook.codevectors[coffs+3]; // FPMATH
>> > }
>> > } ...
>> >
>> > 'get_vlc2' call could be replaced with some GET_VLC/GET_RL_VLC variant
>> > so that the number of intermediate excessive UPDATE_CACHE operations is
>> > minimized.
>>
>> These are all nice ideas but they arent really related to the change here
>> so patch welcome
>
> applied
Sorry for noticing after it was applied, but isn't this the kind of
code that should have a special #ifdef CONFIG_SMALL case?
More information about the ffmpeg-devel
mailing list