[FFmpeg-devel] [PATCH] Higher bit-depth x86 SIMD assembly for yadif
michaelni at gmx.at
Thu Jan 19 22:44:51 CET 2012
CC-ing to dark shikari & loren as they might want to review too?
On Thu, Jan 19, 2012 at 08:55:58PM +0100, James Darnley wrote:
> Attached are five patches which add code for:
> mmx to sse4 instruction sets for 15 and 16 bits per sample
> mmx to ssse3 instruction sets for 9 to 14 bits per sample
> actual support of 9 bits per sample
> I know that 11 to 15 bits per sample don't exist at present but
> support might be added since h264 allows up to 14 bits per sample.
> Anyway, all the code added here is used for existing features.
> Below, I have copied the commit messages for convenience.
> Something else to think about. The source code clarity could be
> greatly improved by using yasm and its preprocessor. I wonder how
> much abstraction it would need to roll the source to all three
> functions together and whether it would save source code size.
if you want to convert it to yasm, thats fine, if not its fine too.
whichever way you prefer
> Subject: [PATCH 1/5] x86 SIMD for 16 bits per sample in yadif
> It might be a rather dumb copy of the 8-bit SIMD but it works and
> produces identical output to the C. The MMX and SSE2 has been tested on
> my Athlon64. The SSSE3 and SSE4.1 needs testing and benching elsewhere.
> Benchmarks on the Athlon64 using a 704px wide video, per line:
> 1693075 decicycles in C, 521977 runs, 2311 skips
> 1029468 decicycles in mmx, 523347 runs, 941 skips
> 730504 decicycles in sse2, 523474 runs, 814 skips
> Subject: [PATCH 2/5] x86 SIMD for 9 to 14 bits per sample in yadif
> These lower bit depths do not need unpacking to double words letting the
> code process more pixels per iteration (still 2 in mmx but 6 in sse2)
> and avoiding emulating the missing double word instructions on older
> instruction sets.
> Benchmarks on my Athlon64 using a 704 pixel wide video, per line:
> 1695927 decicycles in C, 260986 runs, 1158 skips
> 854770 decicycles in mmx, 261717 runs, 427 skips
> 440202 decicycles in sse2, 261829 runs, 315 skips
> Works out at:
> mmx - 1.20 times faster than the 16 bit
> sse2 - 1.66 times faster than the 16 bit
> + "paddd "MM"6, "MM"3 \n\t" /* d+diff */\
> + PMAXSD(MM"2",MM"1",MM"7")\
> + PMINSD(MM"3",MM"1",MM"7")\
> + PACK(MM"1")\
> + :\
> + :[tmpA] "r"(tmpA),\
> + [prev] "r"(prev),\
> + [cur] "r"(cur),\
> + [next] "r"(next),\
> + [prefs]"r"(prefs),\
> + [mrefs]"r"(mrefs),\
> + [mode] "g"(mode)\
this should list the SIMD registers written to on the clobber list
otherwise with SSE* there may be issues on win64 and in theory also
> + );\
> + __asm__ volatile(MOVH" "MM"1, %0" :"=m"(*dst));\
I guess it should be ok in reality but its not guranteed that
SIMD registers dont change between blocks
also feel free to add youself as yadif SIMD maintainer to the
MAINTAINERS file if you like
and very nice work and speed up
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Asymptotically faster algorithms should always be preferred if you have
asymptotical amounts of data
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 198 bytes
Desc: Digital signature
More information about the ffmpeg-devel