[FFmpeg-devel] [PATCH] Higher bit-depth x86 SIMD assembly for yadif

Thu Jan 19 22:44:51 CET 2012

Hi

CC-ing to dark shikari & loren as they might want to review too?

On Thu, Jan 19, 2012 at 08:55:58PM +0100, James Darnley wrote:
> Attached are five patches which add code for:
> mmx to sse4 instruction sets for 15 and 16 bits per sample
> mmx to ssse3 instruction sets for 9 to 14 bits per sample
> actual support of 9 bits per sample
> 
> I know that 11 to 15 bits per sample don't exist at present but
> support might be added since h264 allows up to 14 bits per sample.
> Anyway, all the code added here is used for existing features.
> 
> Below, I have copied the commit messages for convenience.
> 
> Something else to think about.  The source code clarity could be
> greatly improved by using yasm and its preprocessor.  I wonder how
> much abstraction it would need to roll the source to all three
> functions together and whether it would save source code size.

if you want to convert it to yasm, thats fine, if not its fine too.
whichever way you prefer

> 
> Subject: [PATCH 1/5] x86 SIMD for 16 bits per sample in yadif
> 
> It might be a rather dumb copy of the 8-bit SIMD but it works and
> produces identical output to the C.  The MMX and SSE2 has been tested on
> my Athlon64.  The SSSE3 and SSE4.1 needs testing and benching elsewhere.
> 
> Benchmarks on the Athlon64 using a 704px wide video, per line:
> 1693075 decicycles in C, 521977 runs, 2311 skips
> 1029468 decicycles in mmx, 523347 runs, 941 skips
>  730504 decicycles in sse2, 523474 runs, 814 skips
> 
> Subject: [PATCH 2/5] x86 SIMD for 9 to 14 bits per sample in yadif
> 
> These lower bit depths do not need unpacking to double words letting the
> code process more pixels per iteration (still 2 in mmx but 6 in sse2)
> and avoiding emulating the missing double word instructions on older
> instruction sets.
> 
> Benchmarks on my Athlon64 using a 704 pixel wide video, per line:
> 1695927 decicycles in C, 260986 runs, 1158 skips
>  854770 decicycles in mmx, 261717 runs, 427 skips
>  440202 decicycles in sse2, 261829 runs, 315 skips
> 
> Works out at:
> mmx - 1.20 times faster than the 16 bit
> sse2 - 1.66 times faster than the 16 bit

[...]
> +            "paddd     "MM"6, "MM"3 \n\t" /* d+diff */\
> +            PMAXSD(MM"2",MM"1",MM"7")\
> +            PMINSD(MM"3",MM"1",MM"7")\
> +            PACK(MM"1")\
> +\
> +            :\
> +            :[tmpA] "r"(tmpA),\
> +             [prev] "r"(prev),\
> +             [cur]  "r"(cur),\
> +             [next] "r"(next),\
> +             [prefs]"r"(prefs),\
> +             [mrefs]"r"(mrefs),\
> +             [mode] "g"(mode)\

this should list the SIMD registers written to on the clobber list
otherwise with SSE* there may be issues on win64 and in theory also
elsewhere

> +        );\
> +        __asm__ volatile(MOVH" "MM"1, %0" :"=m"(*dst));\

I guess it should be ok in reality but its not guranteed that
SIMD registers dont change between blocks

[...]

also feel free to add youself as yadif SIMD maintainer to the
MAINTAINERS file if you like

and very nice work and speed up

Thanks!

-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Asymptotically faster algorithms should always be preferred if you have
asymptotical amounts of data
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://ffmpeg.org/pipermail/ffmpeg-devel/attachments/20120119/beb6dcb4/attachment.asc>