[FFmpeg-devel] a64 encoder 7th round
Tue Jan 27 22:04:47 CET 2009
> so you claim that copying 256 chars is faster than copying none ?
> if not a system that choose per frame if it used the last or used new
> ones would be better, a fixed 4frame pattern hardly is optimal.
Okay, seems like i have to explain more in detail :-)
The multicolor displayer throws an interrupt each 2 frames and switches
nothing more than a bit in a single register (kind of charmap pointer of
the video chip) and thus advance to a different charmap (0x400 in size).
After 4 charmaps were displayed this way, the intrerupt routine unlocks
the loader to load the next charset + 4 screens into the current buffer
(in 0x400 big chunks/packets via network) but switches to the previous
loaded charset and screens beforehand. So it more or less is double
buffering with 4 preloaded frames.
In your suggested scenario i have to take care of the worst case
scenario/cross point, have to know beforehand how many bytes i need to
load, take care of framesizes. This sounds all trivial, but:
Odd framesizes bloat up the loader, as do varying framesizes, as i need
additional checks, loading can get easily 50% slower in that case (5
additional cycles to 11 cycles or just even 8 cycles if using generated
speedcode), so the cross over point goes rather low, as handling a
charset delta consumes even more cycles. As the worst case is as slow as
if loading plain frames, there is no gain, as framerate and quality do
not improve, but adds lot of complexity to the displayer (that is
already rather big for that, as it needs to drive the network chip and
handle packets). So all i'd do, is saving diskspace, but a 500MB mpeg
file shrinks to a ~ 50MB .a64 file at the moment, so not too much of a
The 6502 is just a very scarce platform, offering only 52 different
instructions not all of them being even orthogonal. I have three 8 bit
registers only as well as a 8 bit data bus only + a 16 bit address bus.
There is no multiply or divide instruction. So concepts that work out
fine on nowadays machines often have to be done in a completely
different way on such machines, often, by making it just plain and easy,
or by doing some fake, that appears to do the same ;-) I invested quite
some time in finding the appropriate display methods, i have done first
prototypes to convert already years ago, and discussed a lot with other
c64 scene members to work out the modes i have so far implemented. As
for doing things on a c64, i can look back to the year 1988 where i did
my first trys on that machine. So things on c64 side should already be
rather optimal, but of course the codecs themselves may have still lots
of potential for (speed/quality) improvements. Saving size so far does
not bring any improvment, except when i can reduce framesize in every case.
> didnt i read somewhere that there was some kind or interrupt per row/line
> from which various things could be changed?
Sure, you can, and i do so in ecmh mode, or better to say, i do in every
4th line. That is why i need two times the charmap, force the video chip
to alternate and reload the charmap each 4th line. This is however time
consuming, as i have to throw 25 interrupts per frame, get the timing
cycle exact by some coding tricks + a hardware timer. Also, when i force
the video chip to reload the current line of the charmap, it takes over
the bus and the cpu has to be idle while the 40 bytes are loaded by the
video chip. That is, what is named bad line in that link i mentioned.
However, this trick does not work with the colorram (what sets the fore
groundcolor of each char), as this is at a fixed address and no register
available to change that. (In hardware, this is even an addtional 1k RAM
chip besdies the normal 64k, so the videochip can access that area
without disturbing the CPU)
So in case of the multicol charset mode, i would only be able to set a
new charmap each 4th line for example, but that would not increase the
ammount of colors. It does however, when i am using the extended
background color mode, but then the charsetsize shrinks to 64 chars
only, the result is not satisfying, i gave it a try already, and that is
also how teh ecmh mode started to exist, but trying it with a selfmade
> so why do you force it always to multicolor?
> if you just copy the stuff anyway, the encoder could choose per attribute
> cell which is better ...
It is, because i can't change the mode per cell, but rather on a per
line basis or even per frame basis. Also, i don't intend to mix modes
within one video, but rather have a video encoded into a mode of your
> with this limitation a pure multicolor encoder should do the following
> for each frame try all 3 fixed color triplets out of 16 that are
> 560 full frame encodes, isnt going to be terribly fast but it should
> be easy to skip some of these triplets.
> for each block try all of the 8 colors and then from the 4 choosen
> colors select per pixel colors with error diffusion dither choosing
> the best block with sum of abs diff in dct domain.
I have that special table color_mixes, that tells my code (not
multicolor) what colors are a good idea to mix (no matter if by
interlacing or dithering) and what colors are definetedly a no go. There
are quiet a lot of ugly combinations and some colors really clash
terribly in PAL. Also having one color being changed each block (while
all others stay the same) leads to a blocky result, i mean it, as in, i
know it, as in i tried it, not only in that case, but also with several
converters for plain graphics for the c64. There are quite some tricks
to avoid that, either by doing certain dithering tricks, and by counting
more on the luninance of a color than its chrominance.
By the way i am doing a kind of similar thing in the ecmh mode as you
described above, more or less a bruteforce attempt with some exclusions
to speed up things to a reasonable time. I find out the best
backgroundcolors by adding them incrementally, then find the best 2
backgroundcolors + colorram for each 8x8 block, as well as the best two
chars for that.
Oh, and as for dithering: Pixels look really big, the 320x200 are
displayed on a 14" monitor with an fbas/s-video input. Error diffusion
is not the choice, except in some very rare cases, like when you display
320x200 with interlaced colors. Just see here for an example: What you
use is ordered dither with certain patterns, and some antialiasing
techniques to improve quality. See here:
Even doing a kind of dithering by using certain forms is common, like
the clouds in this pic show:
>> Making things
>> colorful gets really hard then and usually the result looks very blocky.
> did you try above? :)
I know that even with less restrictions it looks already ugly, that is,
how you easily can differ between handrawn/retouched pics and plain
converted pics :-) And having even less color choice won't be very
helpful either. Also, the colors from 0..7 are not the colors you need
most. For e.g. brown, orange, pink, gray tones, they all are in the
upper range from 8..15. So it is hard to get skin tones done, or do a
proper gray color fade without them. If you want to hurt your eyes i can
calculate some pics with using the lower range only, the limitation is
easily done :-)
So see again at http://www.metalvotze.de/content/videomodes2.php and
look closer at the result of the ecmh mode, you see already some of
those blocky artefacts at the shoulder, that result of a lack of colors
> uint8_t color, index;
> count= read_byte;
> color= read_3bytes;
> *dst++= read_byte;
> *dst++= color;
> *dst++= color;
> *dst++= color;
> this would be too slow?
To read a byte from the network chip packet buffer and store it to the
correct position where it is directly displayable by the videochip, i
need 3 instructions if being lazy. (there is of course some overhead for
loop handling and fetching a new packet).
The above code might be as following in 6502 (just hacked fast, not tested):
ldx $de00 ;count
lda $de01 ;byte1
lda $de00 ;byte2
lda $de01 ;byte3
;27 cycles used till here
lda $de00 ;data is offered 16 bit wide from network chip
;41 more cycles
beq out ; need to check for odd value of x
;4 more cycles if no branch
lda $de01 ;fetch next byte (network chip offers next byte automatically
when both bytes were read, we are happy to have that feature)
bne more ; need to check for even value of x
;5 more cycles, as we branch hopefully a few times.
= 118 cycles to load 6 bytes (will get a bit less of course if the loop
loops a few times, and i assumed already the buffer being in zeropage,
where we can save one cycle when doing lda/sta).
i can do:
that is 48 cycles for 6 bytes.
But more likely i'll do (as it is easier, and still fast enough):
stx a1+1 ;set highbyte of dest in code
ldx #$00 ;index is lowbyte of dest
a1 sta $0000,x
a2 sta $0000,x
a3 sta $0000,x
a4 sta $0000,x
47 cycles per loop + 22 cycles for setup
So over all, i tried many things, even tried RLE and such, it did not
bring any improvment, not with the speed i can achieve with loading that
Convinced now? :-)
More information about the ffmpeg-devel