[FFmpeg-devel] [PATCH] Split H.264 luma dc idct, implement MMX/SSE2 versions

Fri Jan 14 08:12:11 CET 2011

On 1/14/11 3:29 AM, Jason Garrett-Glaser wrote:
> On Wed, Jan 12, 2011 at 8:06 PM, Jason Garrett-Glaser<jason at x264.com>  wrote:
>> This patch splits the H.264 i16x16 luma-dc idct and implements it in
>> asm on x86.  It does this by storing the DC coefficients in a separate
>> location initially, then scattering them at the end of the asm
>> function.  This lets us use SIMD on the inverse transform and dequant.
>>
>> The result is 1043 ->  413 dezicycles spend in the inverse transform.
>>
>> TODOs:
>>
>> 1.  Don't do the idct_dc/dequant if there are no coefficients.  In the
>> current architecture we don't know this; we'd need to add an entry to
>> scan8 (x264 does this) or move the idct-dc call into cabac/cavlc (I'm
>> fine with this too).  You'd still have to modify them in the latter
>> case to, for example, return the number of coefficients.
>>
>> 2.  THIS PROBABLY BREAKS ARM/PPC/SIMILAR because of an extremely
>> stupid architectural problem in ffh264.  That is, the scantables are
>> transposed in the case of asm, but not in the case of C.  So this
>> means that if my idct_dc function isn't implemented in asm for all
>> architectures that have idct implemented in asm, they'll probably
>> break.  The best solution would be to just throw out the
>> non-transposed scan table: there is zero benefit to having it at all
>> and it just adds complexity and binary size.
>>
>> Jason
>>
>
> Here's an updated version with three patches.  It passes FATE with and
> without asm.
>
> First patch: eliminate the non-transposed scantable mode.  There's no
> reason for this to even exist and other devs on IRC agree.  It just
> adds complexity.  NB: This change allows some simplification, like
> removing some entries in H264Context, but to keep this patch simple I
> didn't do all possible simplifications.
>
> Second patch: Mostly rewritten with new asm and stuff to handle an
> obnoxious corner case.
>
> Third patch: Use x264-style DC NNZ tracking to avoid calling
> dequant_idct when there's no coefficients to deal with.  This could
> also be used in the future to add dc_dequant_idct, which would handle
> the very common case of there being only a DC luma/chroma DC
> coefficient.  In short: in addition to what this does, there's extra
> optimization opportunities added by this patch.

 From what I'm seeing it would be faster already with those patches, did 
you minibenchmark it already?

lu