[FFmpeg-devel] a64 encoder 7th round

Michael Niedermayer michaelni
Sat Jan 31 23:16:29 CET 2009


On Sat, Jan 31, 2009 at 08:26:20PM +0100, Bitbreaker/METALVOTZE wrote:
> Michael Niedermayer schrieb:
> > On Sat, Jan 31, 2009 at 01:59:48PM +0100, Bitbreaker/METALVOTZE wrote:
> >   
> >>>>> now a few questions, i hope iam not too annoying
> >>>>> the low nibble is either 15 or 8 if i did RTFS correctly
> >>>>> do you have 64 byte left for a LUT?
> >>>>> if so you can do some code equivalent to
> >>>>>
> >>>>> x= read_net();
> >>>>> dst[0]=lut[x  ];
> >>>>> dst[1]=lut[x+1];
> >>>>> dst[2]=lut[x+2];
> >>>>> dst[3]=lut[x+3];
> >>>>> x= read_net();
> >>>>> dst[4]=lut[x  ];
> >>>>> dst[5]=lut[x+1];
> >>>>> dst[6]=lut[x+2];
> >>>>> dst[7]=lut[x+3];
> >>>>> ...
> >>>>>
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> Gotta see tomorrow if that works...
> >>>>     
> >>>>         
> >> Hmm, i am afraid, a lookup from the table is as expensive as reading a 
> >> byte from the network :-) So over all you end up at the same speed again.
> >>     
> >
> > but its lower bitrate, so if its not worse in any other way you at least
> > have smaller files, and not by a insignificant amount smaller
> > was there a disadvantage in thr 5col mode over 4col except filesize ?
> >   
> Files are rather small already compared to a normal video. But for the 
> sake of size i might stuff 2 nibbles together (2, just to still have the 
> chance to use the full range of colors somewhen, you never know, so we 
> better keep that option). That would save 0x200 bytes and add a LUT and 
> extra code to the displayer. Also, i'd have the last packet ending at an 
> 0x100 boundary what would avoid even more extra code on c64 side. But i 
> might implement that and therefor also interleave the charset so i get a 
> constant packet size.
> Disadvantage of 5col mode is the size of a frame itself, as i can't load 
> it within 2 vsyncs. 4col mode works well between 2 vsyncs. Ecmh mode 
> even needs 4 vysncs as it loads 0xc00 bytes per frame and the forcing of 
> additional badlines consumes even more time.
> As for 5col mode i am anyway not sure if it is the nicest thing to 
> either lift the darkest area or lower the brightes area if both occur in 
> a single block. But that is nothing that helps regarding the framesize 
> and loading times :-)

you could always include white and black and switch a middle one, assuming
theres no odd limitation that makes that impossible


> 
> >>    [setup as always]
> >>    ...
> >>   
> >>    ldx $de00
> >>    lda lut+0,x
> >>    sta dest,y
> >>    iny
> >>    lda lut+0,x
> >>    sta dest,y
> >>    iny
> >>    lda lut+2,x
> >>    sta dest,y
> >>    iny
> >>    lda lut+3,x
> >>    sta dest,y
> >>    iny
> >>
> >> that is 12 cycles per reconstructed byte in the inner loop.
> >>     
> >
> > i see 4+5+2=11 per byte output + some overhead per each 4 byte group
> >   
> yes, i counted the ldx in, as it can't be avoided per block.
> > but this can be improved, you dont need to write 4 consecutive bytes
> > you can write (0,64,128,192), (1,65,129,193), ...
> > code should be:
> >
> > ldx $de00
> > lda lut+0,x
> > sta dest,y
> > lda lut+1,x
> > sta dest+64,y
> > lda lut+2,x
> > sta dest+128,y
> > lda lut+3,x
> > sta dest+192,y
> > iny
> >
> > this safes 3 iny per 4 bytes written, thus 2*3/4=1.5 cycles faster
> > the same trick might be useable for the generic copy from network as well
> > maybe?
> >   
> sure, i can also completely unroll the loop and then save even more, but 
> as long as i don't save my 18700 cycles, there is no need to, as no 
> improvement happens. So no need to make things more complex :-) See, for 
> storing the bytes i'll always need my 5 cycles (or 4 on a complete 
> unroll without index), there is nothing that can be done to avoid that. 
> Depending on how much i unroll things there are 4-6 cycles needed for 
> getting the byte or getting some bytes and do some kind of decoding. If 
> i'd save 6 cycles in best case per byte and need to load 10*256 bytes 
> (one average frame in 5col) i would over all save 15360 cycles, that is, 
> if bytes would be loaded for no cost automagically :-) That is still not 
> my desired goal of 18700 cycles to save :-) So how to solve that? :-)

you dont load them
i did mention this didnt i? Reuse data from the previous frame or 4th pevious
depending on mem layout, like P frames.
you have 1000 byte of this colorram thing storing the 5th color, why do you
change it every frame?
change it every 2 or 4, and ideally let the encoder decide when to update
within some limit instead of hardcoding a every 4 frame update.

also whats this 18700 cycles thing?
4col mode can be done in 2vsync
5col needs 3vsync
the difference are 25*40=1000 byte per frame, which our 8 entry LUT
would need ~10 cycles per byte, thats 10k cycles not 18.7k
and as you said your normal copy is not optimal either so there must
be more headroom if 4col works at full speed currently.

so what is really missing to make 5col as fast as 4col?

also with codecs like mpeg4 a 0.1% loss from what is achivable means
rejection, you argue that a 30% reduction in filesize is negligible ...
And that is a 30% reduction that at the same time is faster than your
current code, even if its not fast enough to make the next vsync it
means more free time that could be used for other things.

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

The greatest way to live with honor in this world is to be what we pretend
to be. -- Socrates
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20090131/c3461497/attachment.pgp>



More information about the ffmpeg-devel mailing list