[FFmpeg-devel] [PATCH] Split H.264 luma dc idct, implement MMX/SSE2 versions

Jason Garrett-Glaser jason
Fri Jan 14 03:29:35 CET 2011


On Wed, Jan 12, 2011 at 8:06 PM, Jason Garrett-Glaser <jason at x264.com> wrote:
> This patch splits the H.264 i16x16 luma-dc idct and implements it in
> asm on x86. ?It does this by storing the DC coefficients in a separate
> location initially, then scattering them at the end of the asm
> function. ?This lets us use SIMD on the inverse transform and dequant.
>
> The result is 1043 -> 413 dezicycles spend in the inverse transform.
>
> TODOs:
>
> 1. ?Don't do the idct_dc/dequant if there are no coefficients. ?In the
> current architecture we don't know this; we'd need to add an entry to
> scan8 (x264 does this) or move the idct-dc call into cabac/cavlc (I'm
> fine with this too). ?You'd still have to modify them in the latter
> case to, for example, return the number of coefficients.
>
> 2. ?THIS PROBABLY BREAKS ARM/PPC/SIMILAR because of an extremely
> stupid architectural problem in ffh264. ?That is, the scantables are
> transposed in the case of asm, but not in the case of C. ?So this
> means that if my idct_dc function isn't implemented in asm for all
> architectures that have idct implemented in asm, they'll probably
> break. ?The best solution would be to just throw out the
> non-transposed scan table: there is zero benefit to having it at all
> and it just adds complexity and binary size.
>
> Jason
>

Here's an updated version with three patches.  It passes FATE with and
without asm.

First patch: eliminate the non-transposed scantable mode.  There's no
reason for this to even exist and other devs on IRC agree.  It just
adds complexity.  NB: This change allows some simplification, like
removing some entries in H264Context, but to keep this patch simple I
didn't do all possible simplifications.

Second patch: Mostly rewritten with new asm and stuff to handle an
obnoxious corner case.

Third patch: Use x264-style DC NNZ tracking to avoid calling
dequant_idct when there's no coefficients to deal with.  This could
also be used in the future to add dc_dequant_idct, which would handle
the very common case of there being only a DC luma/chroma DC
coefficient.  In short: in addition to what this does, there's extra
optimization opportunities added by this patch.

Jason
-------------- next part --------------
A non-text attachment was scrubbed...
Name: stuff.diff
Type: text/x-patch
Size: 38790 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110113/22270bc6/attachment.bin>



More information about the ffmpeg-devel mailing list