[FFmpeg-devel] [PATCH] Split H.264 luma dc idct, implement MMX/SSE2 versions
Fri Jan 14 08:22:49 CET 2011
On Thu, Jan 13, 2011 at 11:12 PM, Luca Barbato <lu_zero at gentoo.org> wrote:
> On 1/14/11 3:29 AM, Jason Garrett-Glaser wrote:
>> On Wed, Jan 12, 2011 at 8:06 PM, Jason Garrett-Glaser<jason at x264.com>
>>> This patch splits the H.264 i16x16 luma-dc idct and implements it in
>>> asm on x86. ?It does this by storing the DC coefficients in a separate
>>> location initially, then scattering them at the end of the asm
>>> function. ?This lets us use SIMD on the inverse transform and dequant.
>>> The result is 1043 -> ?413 dezicycles spend in the inverse transform.
>>> 1. ?Don't do the idct_dc/dequant if there are no coefficients. ?In the
>>> current architecture we don't know this; we'd need to add an entry to
>>> scan8 (x264 does this) or move the idct-dc call into cabac/cavlc (I'm
>>> fine with this too). ?You'd still have to modify them in the latter
>>> case to, for example, return the number of coefficients.
>>> 2. ?THIS PROBABLY BREAKS ARM/PPC/SIMILAR because of an extremely
>>> stupid architectural problem in ffh264. ?That is, the scantables are
>>> transposed in the case of asm, but not in the case of C. ?So this
>>> means that if my idct_dc function isn't implemented in asm for all
>>> architectures that have idct implemented in asm, they'll probably
>>> break. ?The best solution would be to just throw out the
>>> non-transposed scan table: there is zero benefit to having it at all
>>> and it just adds complexity and binary size.
>> Here's an updated version with three patches. ?It passes FATE with and
>> without asm.
>> First patch: eliminate the non-transposed scantable mode. ?There's no
>> reason for this to even exist and other devs on IRC agree. ?It just
>> adds complexity. ?NB: This change allows some simplification, like
>> removing some entries in H264Context, but to keep this patch simple I
>> didn't do all possible simplifications.
>> Second patch: Mostly rewritten with new asm and stuff to handle an
>> obnoxious corner case.
>> Third patch: Use x264-style DC NNZ tracking to avoid calling
>> dequant_idct when there's no coefficients to deal with. ?This could
>> also be used in the future to add dc_dequant_idct, which would handle
>> the very common case of there being only a DC luma/chroma DC
>> coefficient. ?In short: in addition to what this does, there's extra
>> optimization opportunities added by this patch.
> From what I'm seeing it would be faster already with those patches, did you
> minibenchmark it already?
Yes it is faster, I just meant that this opens yet more opportunities.
More information about the ffmpeg-devel