[FFmpeg-devel] [PATCH] vf_overlay: add support to RGBA packed input and output
Michael Niedermayer
michaelni at gmx.at
Sat Oct 29 18:09:40 CEST 2011
On Sat, Oct 29, 2011 at 05:26:52PM +0200, Michael Niedermayer wrote:
> On Sat, Oct 29, 2011 at 04:47:41PM +0200, Stefano Sabatini wrote:
> > On date Saturday 2011-10-29 04:10:04 +0200, Michael Niedermayer encoded:
> > > On Sat, Oct 29, 2011 at 12:56:15AM +0200, Stefano Sabatini wrote:
> > [...]
> > > > > please benchmark this with START/STOP_TIMER against the previous code
> > > >
> > > > RGB path was disabled before this one, I split the present patch and
> > > > did some tests.
> > > >
> > > > * Test with no alpha in the main input
> > > >
> > > > before alpha premultiplication
> > > > 1287135 dezicycles in first, 2 runs, 0 skips
> > > > 1335442 dezicycles in first, 4 runs, 0 skips
> > > > 1245555 dezicycles in first, 8 runs, 0 skips
> > > > 1162359 dezicycles in first, 16 runs, 0 skips
> > > > 1144390 dezicycles in first, 32 runs, 0 skips
> > > > 1134602 dezicycles in first, 64 runs, 0 skips
> > > > 1133281 dezicycles in first, 128 runs, 0 skips
> > > > 1114852 dezicycles in first, 256 runs, 0 skips
> > > > 1108999 dezicycles in first, 512 runs, 0 skips
> > > > 1101536 dezicycles in first, 1024 runs, 0 skips
> > > > 1096821 dezicycles in first, 2048 runs, 0 skips
> > > > 1090508 dezicycles in first, 4096 runs, 0 skips
> > > > 1085896 dezicycles in first, 8192 runs, 0 skips
> > > > 1084802 dezicycles in first, 16384 runs, 0 skips
> > > > 1083604 dezicycles in first, 32768 runs, 0 skips
> > > >
> > > > after alpha premultiplication
> > > > 1224390 dezicycles in second, 2 runs, 0 skips
> > > > 1202235 dezicycles in second, 4 runs, 0 skips
> > > > 1191453 dezicycles in second, 8 runs, 0 skips
> > > > 1183031 dezicycles in second, 16 runs, 0 skips
> > > > 1230087 dezicycles in second, 32 runs, 0 skips
> > > > 1227492 dezicycles in second, 64 runs, 0 skips
> > > > 1230488 dezicycles in second, 128 runs, 0 skips
> > > > 1215128 dezicycles in second, 256 runs, 0 skips
> > > > 1207364 dezicycles in second, 512 runs, 0 skips
> > > > 1199813 dezicycles in second, 1024 runs, 0 skips
> > > > 1195857 dezicycles in second, 2048 runs, 0 skips
> > > > 1193954 dezicycles in second, 4096 runs, 0 skips
> > > > 1194128 dezicycles in second, 8192 runs, 0 skips
> > > > 1187481 dezicycles in second, 16384 runs, 0 skips
> > > > 1181874 dezicycles in second, 32768 runs, 0 skips
> > > >
> > > > * Test with alpha in the main input:
> > > > 28684935 dezicycles in first, 2 runs, 0 skips
> > > > 28553902 dezicycles in first, 4 runs, 0 skips
> > > > 28776015 dezicycles in first, 8 runs, 0 skips
> > > > 29073680 dezicycles in first, 16 runs, 0 skips
> > > > 28816918 dezicycles in first, 32 runs, 0 skips
> > > > 28908704 dezicycles in first, 64 runs, 0 skips
> > > > 28745401 dezicycles in first, 128 runs, 0 skips
> > > > 28614980 dezicycles in first, 256 runs, 0 skips
> > > > 28609710 dezicycles in first, 512 runs, 0 skips
> > > > 28537037 dezicycles in first, 1024 runs, 0 skips
> > > > 28517850 dezicycles in first, 2048 runs, 0 skips
> > > > 28466515 dezicycles in first, 4096 runs, 0 skips
> > > > 28438388 dezicycles in first, 8192 runs, 0 skips
> > > > 28440383 dezicycles in first, 16384 runs, 0 skips
> > > > 28426314 dezicycles in first, 32768 runs, 0 skips
> > > >
> > > > 33347880 dezicycles in second, 2 runs, 0 skips
> > > > 33131272 dezicycles in second, 4 runs, 0 skips
> > > > 38018970 dezicycles in second, 8 runs, 0 skips
> > > > 48715928 dezicycles in second, 16 runs, 0 skips
> > > > 44290285 dezicycles in second, 32 runs, 0 skips
> > > > 43696766 dezicycles in second, 64 runs, 0 skips
> > > > 38599173 dezicycles in second, 128 runs, 0 skips
> > > > 36112571 dezicycles in second, 256 runs, 0 skips
> > > > 34737837 dezicycles in second, 512 runs, 0 skips
> > > > 34066213 dezicycles in second, 1024 runs, 0 skips
> > > > 33640178 dezicycles in second, 2048 runs, 0 skips
> > > > 33368757 dezicycles in second, 4096 runs, 0 skips
> > > > 33233522 dezicycles in second, 8192 runs, 0 skips
> > > > 33132908 dezicycles in second, 16384 runs, 0 skips
> > > > 33062949 dezicycles in second, 32768 runs, 0 skips
> > > >
> > > > Results are as expected, alpha pre-multiplication is significantly
> > > > slower but it may also be what the user wants, so I could make it
> > > > optional (and preserve the original alpha?, enabled by default?).
> > >
> > > thats not what i meant
> > >
> > > the original code looked like this:
> > > > - d[r] = (d[r] * (0xff - s[3]) + s[0] * s[3] + 128) >> 8;
> > > > - d[1] = (d[1] * (0xff - s[3]) + s[1] * s[3] + 128) >> 8;
> > > > - d[b] = (d[b] * (0xff - s[3]) + s[2] * s[3] + 128) >> 8;
> > >
> > > when i saw what you replaced it by i was ... scared ;)
> > >
> > > if and switch are added in the innermost loop
> > > constants are replaced by variables
> > > variables are replaced by reading out of arrays from structures
> > > a division is added
> > >
> > > all this make the code significantly slower
> > >
> > > Can you explain what equation you are trying to implement ?
> >
> >
> > Changed the code (second patch), now testbench results changed from:
> > 29891505 dezicycles in first, 2 runs, 0 skips
> > 29780850 dezicycles in first, 4 runs, 0 skips
> > 30056100 dezicycles in first, 8 runs, 0 skips
> > 30378746 dezicycles in first, 16 runs, 0 skips
> > 31263998 dezicycles in first, 32 runs, 0 skips
> > 31422349 dezicycles in first, 64 runs, 0 skips
> > 31441573 dezicycles in first, 128 runs, 0 skips
> > 31319009 dezicycles in first, 256 runs, 0 skips
> > 30925767 dezicycles in first, 512 runs, 0 skips
> > 33965521 dezicycles in first, 1024 runs, 0 skips
> > 32342480 dezicycles in first, 2048 runs, 0 skips
> > 31631954 dezicycles in first, 4096 runs, 0 skips
> > 31252298 dezicycles in first, 8192 runs, 0 skips
> > 31572626 dezicycles in first, 16383 runs, 1 skips
> > 31102288 dezicycles in first, 32767 runs, 1 skips
> >
> > to:
> > 26084640 dezicycles in first, 2 runs, 0 skips
> > 23856690 dezicycles in first, 4 runs, 0 skips
> > 24238267 dezicycles in first, 8 runs, 0 skips
> > 26151311 dezicycles in first, 16 runs, 0 skips
> > 25807400 dezicycles in first, 32 runs, 0 skips
> > 27391090 dezicycles in first, 64 runs, 0 skips
> > 26028030 dezicycles in first, 128 runs, 0 skips
> > 23729756 dezicycles in first, 256 runs, 0 skips
> > 22114165 dezicycles in first, 512 runs, 0 skips
> > 21465190 dezicycles in first, 1024 runs, 0 skips
> > 20951560 dezicycles in first, 2048 runs, 0 skips
> > 20736770 dezicycles in first, 4096 runs, 0 skips
> > 20573711 dezicycles in first, 8192 runs, 0 skips
> > 20570483 dezicycles in first, 16384 runs, 0 skips
> > 20634111 dezicycles in first, 32768 runs, 0 skips
> >
> > With the second patch applied (non alpha in input):
> > 24551340 dezicycles in second, 2 runs, 0 skips
> > 23764147 dezicycles in second, 4 runs, 0 skips
> > 23118037 dezicycles in second, 8 runs, 0 skips
> > 22992204 dezicycles in second, 16 runs, 0 skips
> > 22960603 dezicycles in second, 32 runs, 0 skips
> > 23015486 dezicycles in second, 64 runs, 0 skips
> > 23007612 dezicycles in second, 128 runs, 0 skips
> > 22955180 dezicycles in second, 256 runs, 0 skips
> > 23277693 dezicycles in second, 512 runs, 0 skips
> > 23147960 dezicycles in second, 1024 runs, 0 skips
> > 22940401 dezicycles in second, 2048 runs, 0 skips
> > 22811952 dezicycles in second, 4096 runs, 0 skips
> > 22760982 dezicycles in second, 8192 runs, 0 skips
> > 22676573 dezicycles in second, 16384 runs, 0 skips
> > 22622130 dezicycles in second, 32768 runs, 0 skips
> > (due to the added ifs).
> >
> > With alpha in the main input/output:
> > 41009130 dezicycles in second, 2 runs, 0 skips
> > 36964740 dezicycles in second, 4 runs, 0 skips
> > 34723803 dezicycles in second, 8 runs, 0 skips
> > 39728604 dezicycles in second, 16 runs, 0 skips
> > 40790327 dezicycles in second, 32 runs, 0 skips
> > 38958495 dezicycles in second, 64 runs, 0 skips
> > 36674410 dezicycles in second, 128 runs, 0 skips
> > 35057610 dezicycles in second, 256 runs, 0 skips
> > 33985402 dezicycles in second, 512 runs, 0 skips
> > 33323452 dezicycles in second, 1024 runs, 0 skips
> > 32870493 dezicycles in second, 2048 runs, 0 skips
> > 32565989 dezicycles in second, 4096 runs, 0 skips
> > 32464448 dezicycles in second, 8192 runs, 0 skips
> > 32574558 dezicycles in second, 16384 runs, 0 skips
> > 32468892 dezicycles in second, 32768 runs, 0 skips
> >
> > Regarding the second patch, I kept Mark's code but after some time
> > spent tinkering on it I couldn't figure out the meaning of the
> > equation:
>
> > d[da] = ( (d[da] << 8) + (256 - d[da]) * s[sa] ) >> 8;
>
> correcter:
> d += ((255 - d) * s + 128) / 255;
>
> and /255 can be done by multiplication and shift
d += (((255 - d) * s + 129)*257)>>16
[...]
