[FFmpeg-devel] [PATCH][v2] mmx implementation of vc-1 inverse transformations

Yuriy Kaminskiy yumkam
Fri Jan 21 14:15:02 CET 2011


As it lies unchanged for few month already, it seems not very likely I'll be
able to work on it further, so posting patches as is. Not all issues addressed,
but nothing will be improved by me, sorry.
-- 

Ronald S. Bultje wrote:
> Hi,
> 
> as a start: I agree with Jason, I'd prefer new code like this to be
> written in yasm, but I won't object since I didn't write it.

As I said: not my code, I don't know assembler (and algos) good enough to
rewrite it; I'm only reposting with minimal necessary changes.

> 2010/9/30 Yuriy Kaminskiy <yumkam at mail.ru>:
> [..]
>> Index: libavcodec/vc1dec.c
>> @@ -2086,7 +2086,7 @@ static int vc1_decode_p_block(VC1Context
>>          for(j = 0; j < 2; j++) {
>>              last = subblkpat & (1 << (1 - j));
>>              i = 0;
>> -            off = j * 32;
>> +            off = s->dsp.vc1_inv_trans_8x4_transposed ? j * 4 : j * 32;
>>              while (!last) {
>>                  vc1_decode_ac_coeff(v, &last, &skip, &value, v->codingset2);
>>                  i += skip;
> 
> I would prefer if we wouldn't add random fields in DSPContext, this
> will quickly go crazy. Better check (like h264 does) whether the
> function is the C function (which then needs to be exported in a
> header) and do this behaviour based on that.

Won't work. PPC asm uses untransposed order too. And there are *two*
c variants. And I don't like comparing function pointers.
So I kept this as is.

>> -#define LOAD4(stride,in,a,b,c,d)\
>> -    "movq 0*"#stride"+"#in", "#a"\n\t"\
>> -    "movq 1*"#stride"+"#in", "#b"\n\t"\
>> -    "movq 2*"#stride"+"#in", "#c"\n\t"\
>> -    "movq 3*"#stride"+"#in", "#d"\n\t"
>> +#define LOAD4(m,stride,in,a,b,c,d)\
>> +    "mov"#m" 0*"#stride"+"#in", "#a"\n\t"\
>> +    "mov"#m" 1*"#stride"+"#in", "#b"\n\t"\
>> +    "mov"#m" 2*"#stride"+"#in", "#c"\n\t"\
>> +    "mov"#m" 3*"#stride"+"#in", "#d"\n\t"
> 
> This is one of those cases where, yes, it saves space but really,
> you'll have to change so much code and it doesn't quite become more
> readable, so I wouldn't mind if you'd just write out the SSE macros as
> separate macros. That is, if you want this kept in dsputil_mmx.h.
> 
> Should VC1-specific macros go in dsputil_mmx.h or in a vc1-specific
> file (e.g. in the file where it's used)? LOAD4/STORE4 are unused. In
> that case, I don't mind the above macro since it's not in a
> ubiquitously-included header.

*Now* unsed. When this patch was submitted it was used in h264 inline asm.
Moved to vc1dsp_simd.h.

>> +/*
>> +    precodition:
> 
> Precondition. Please run a spell-checker, I don't intend to nit too
> much but this sort of stuff you can easily pick up yourself.
fixed.

>> +#define OPC_SS_AB(opc, src, dst0, dst1)\
>> +    #opc" "#src","#dst0"\n\t"\
>> +    #opc" "#src","#dst1"\n\t"
>> +
>> +#define OPC_SSSS_ABCD(opc, src, dst0, dst1, dst2, dst3)\
>> +    OPC_SS_AB(opc, src, dst0, dst1)\
>> +    OPC_SS_AB(opc, src, dst2, dst3)
> 
> I think this borders obfuscation. It doesn't really hurt in any way
> but it does hide the actual code being done up to an abstraction
> level, and really at no gain. There is very little chance that without
> this macro, we write slower code. I'd prefer this removed and the
> resulting code written out fully, it's not that much longer.

Not my code -> don't understand it -> no massive changes, sorry.

>> +#define TRANSFORM_4X4_COMMON(m,r0,r1,r2,r3,r4,r5,r6,r7,c)\
> 
>> +    "mov"#m"  "#c", "#r5"\n\t" /* c */\
>> +    "mov"#m" "#r5", "#r7"\n\t" /* c */\
>> +    SUMSUB_BA(r2,r0)\
>> +    "paddw   "#r0", "#r5"\n\t" /* r0 - r2 + c */\
> 
> Note the dependency chain for r5; put the second mov#m below the
> sumsub_ba, should be faster.

IIRC, this question was raised in Y2008 round. There were no big gain from
reordering, and it obfuscates code even more. So I kept as is, and added
separate reordering patch (hope, I did not break anything :-|). *If* testing on
affected cpu's will show noticeable difference - fine, squash with 30-*/40-* resp.
Otherwise better leave it as it was. On my cpu - it changes only microbenchmarks
(and I don't trust microbenchmarks; they tends to change in unpreductable ways
with unrelated code changes [due to alignment/caching?])

>> +    "paddw   "#r2", "#r7"\n\t" /* r0 + r2 + c */\
>> +    OPC_SS_AB(psraw,$1,r5,r7)\
>> +    "mov"#m" "#r1", "#r4"\n\t" /* r1 */\
>> +    "mov"#m" "#r3", "#r6"\n\t" /* r3 */\
> 
>> +    "paddw   "#r1", "#r1"\n\t" /* 2 * r1 */\
>> +    "paddw   "#r6", "#r1"\n\t" /* 2 * r1 + r3 */\
>> +    "paddw   "#r3", "#r3"\n\t" /* 2 * r3 */\
>> +    "psubw   "#r4", "#r3"\n\t" /* 2 * r3 - r1 */\
> 
> These two can be interleaved.
ok (in separate patch)

>> +    "paddw   "#r1", "#r4"\n\t" /* 3 * r1 + r3 */\
>> +    "paddw   "#r3", "#r6"\n\t" /* 3 * r3 - r1 */\
> 
>> +    SUMSUB_BA(r4,r7)\
>> +    SUMSUB_BA(r6,r5)\
> 
> These also. Feel free to add a SUMSUB_BAx2 macro. (This is
> particularly relevant for OOE CPUs, so you may not see any performance
> chance for this yourself depending on your CPU).

Added SUMSUB_BADC (also used in cavsdsp_mmx.c [totally *UNTESTED*])

>> +    OPC_SSSS_ABCD(psraw,$2,r4,r7,r5,r6)\
>> +    "paddw   "#r1", "#r4"\n\t" /* ((((r0 + r2 + c) >> 1) + (3 * r1 + r3)) >> 2) + (2 * r1 + r3) */\
>> +    "psubw   "#r3", "#r5"\n\t" /* ((((r0 - r2 + c) >> 1) - (3 * r3 - r1)) >> 2) - (2 * r3 - r1) */\
>> +    "paddw   "#r3", "#r6"\n\t" /* ((((r0 - r2 + c) >> 1) + (3 * r3 - r1)) >> 2) + (2 * r3 - r1) */\
>> +    "psubw   "#r1", "#r7"\n\t" /* ((((r0 + r2 + c) >> 1) - (3 * r1 + r3)) >> 2) - (2 * r1 + r3) */
> [..]
>> +#define TRANSFORM_8X4_ROW_COMMON(m,r0,r1,r2,r4,r5,r6,r7)\
>> +    "mov"#m" "#r2", "#r4"\n\t" /* r6 */\
>> +    "paddw   "#r4", "#r4"\n\t" /* 2 * r6 */\
>> +    "paddw   "#r4", "#r2"\n\t" /* 3 * r6 */\
>> +    "mov"#m" "#r0", "#r6"\n\t" /* r0 */\
>> +    "paddw   "#r1", "#r6"\n\t" /* r0 + r1 */\
> 
> Again, the bottom two are independent from the other three: interleave.

ok (in separate patch)

>> +    "mov"#m" "#r1", "#r4"\n\t" /* r1 */\
>> +    "psubw   "#r0", "#r4"\n\t" /* r1 - r0 */\
>> +    "paddw   "#r6", "#r0"\n\t" /* 2 * r0 + r1 */\
>> +    "paddw   "#r4", "#r1"\n\t" /* 2 * r1 - r0 */\
> 
>> +    "pxor    "#r5", "#r5"\n\t" /* 0 */\
>> +    "mov"#m" "#r5", "#r7"\n\t" /* 0 */\
> 
> I'm quite sure pxor m7, m7 would be faster since then there's no
> dependency chain.

ok.

>> +    "psubw   "#r6", "#r5"\n\t" /* -(r1 + r0) */\
>> +    "psubw   "#r4", "#r7"\n\t" /* -(r1 - r0) */\
>> +    OPC_SSSS_ABCD(psraw,$1,r4,r5,r6,r7)
> [..]
> 
>> +#define TRANSFORM_4X8_COL_COMMON(m,r0,r1,r2,r4,r5,r6,r7,t,c)\
>> +    "pcmpeqw  "#t",  "#t"\n\t" /* -1 */\
>> +    "mov"#m" "#r2", "#r4"\n\t" /* r2 */\
>> +    "paddw   "#r4", "#r4"\n\t" /* 2 * r2 */\
>> +    "paddw   "#r4", "#r2"\n\t" /* 3 * r2 */\
>> +    "mov"#m" "#r0", "#r6"\n\t" /* r0 */\
>> +    "paddw   "#r1", "#r6"\n\t" /* r0 + r1 */\
> 
> Interleave (as per above).

ok (in separate patch)

>> +    "mov"#m" "#r1", "#r4"\n\t" /* r1 */\
>> +    "psubw   "#r0", "#r4"\n\t" /* r1 - r0 */\
>> +    "paddw   "#r6", "#r0"\n\t" /* 2 * r0 + r1 */\
>> +    "paddw   "#r4", "#r1"\n\t" /* 2 * r1 - r0 */\
>> +    "mov"#m"  "#c", "#r5"\n\t" /* c */\
>> +    "mov"#m" "#r5", "#r7"\n\t" /* c */\
>> +    SUMSUB_BA(r6,r5)\
>> +    SUMSUB_BA(r4,r7)\
> 
> SUMSUB_BAx2

ok

>> +    OPC_SS_AB(psubw,t,r6,r7)\
>> +    OPC_SSSS_ABCD(psraw,$1,r4,r5,r6,r7)
> [..]
> 
>> +#define LOAD_ADD_CLAMP_STORE1(m,io,t,z,r)\
>> +    "mov"#m"   "#io",  "#t"\n\t"\
>> +    "punpcklbw  "#z",  "#t"\n\t"\
>> +    "paddw      "#t",  "#r"\n\t"\
>> +    "packuswb   "#z",  "#r"\n\t"\
>> +    "mov"#m"    "#r", "#io"\n\t"
> 
> I would swear I thought we had an inline STORE_DIFF macro somewhere.
> Please at least rename this to STORE_DIFF. Maybe this a duplicate of
> ff_put_pixels_clamped_mmx() (if you simply "call" it, then gcc will
> inline it, i.e. no performance penalty but smaller source).

It is not same as yasm's STORE_DIFF. So renaming will only increase confusion.

>> +#define LOAD_ADD_CLAMP_STORE4(m,stride,io,t,z,r0,r1,r2,r3)\
>> +    LOAD_ADD_CLAMP_STORE1(m,(io),t,z,r0)\
>> +    "add "#stride", "#io"\n\t"\
>> +    LOAD_ADD_CLAMP_STORE1(m,(io),t,z,r1)\
>> +    "add "#stride", "#io"\n\t"\
>> +    LOAD_ADD_CLAMP_STORE1(m,(io),t,z,r2)\
>> +    "add "#stride", "#io"\n\t"\
>> +    LOAD_ADD_CLAMP_STORE1(m,(io),t,z,r3)
> 
> If somehow the above doesn't work, then at least please use a
> STORE_DIFFx2, as per x86util.asm (yes, that's yasm, but the logic is
> the same). Again that's interleaving, and also it saves you 2 of 3 add
> stride instructions.
Ummm... tried to do something, not sure about results. In separate patch to ease
benchmarking.

> Please rename the file to vc1dsp_mmx.h, there's nothing i386-generic
> about these functions and it's in line with the rest (e.g.
> dsputil_mmx.h).

It is common file for MMX and SSE2 implementations, so unspecific name have some
justification (IIRC it was explained by original author in Y2008 discussion of
this patch).

>> +static void vc1_inv_trans_8x8_mmx(DCTELEM block[64])
>> +{
>> +    DECLARE_ALIGNED(16, int16_t, temp)[64];
>> +    __asm__ volatile(
>> +    LOAD4(q,0x10,0x00(%0),%%mm5,%%mm1,%%mm0,%%mm3)
> 
> Indenting, this becomes 2xsemi-unreadable.
Indenting will make some lines too long. Anyway, reindented.

>> +    TRANSPOSE4(%%mm5,%%mm1,%%mm0,%%mm3,%%mm4)
>> +    STORE4(q,0x10,0x00(%0),%%mm5,%%mm3,%%mm4,%%mm0)
>> +
>> +    LOAD4(q,0x10,0x08(%0),%%mm6,%%mm5,%%mm7,%%mm1)
>> +    TRANSPOSE4(%%mm6,%%mm5,%%mm7,%%mm1,%%mm2)
>> +    STORE4(q,0x10,0x08(%0),%%mm6,%%mm1,%%mm2,%%mm7)
> 
>> +    TRANSFORM_8X4_ROW_H1
>> +    (
>> +        q,q,
>> +        0x00(%0),0x20(%0),0x08(%0),0x38(%0),
>> +        %%mm0,%%mm1,%%mm2,%%mm3,%%mm4,%%mm5,%%mm6,%%mm7
>> +    )
> 
> This is even worse, this is absolutely unreadable. :-). Same for all
> other functions in this file.
> 
> This function does the same operation 4x, can you create a loop

No, only 2x. And I would not trust gcc.

TRANSFORM_8X4_ROW_H1 [...+0x00...]
TRANSFORM_8X4_ROW_H2

TRANSFORM_8X4_ROW_H1 [...+0x40...]
TRANSFORM_8X4_ROW_H2

TRANSFORM_4X8_COL_H1 [...+0x08...]
TRANSFORM_4X8_COL_H2

TRANSFORM_4X8_COL_H1 [...+0x00...]
TRANSFORM_4X8_COL_H2

> without slowing it down significantly? (gcc should unroll by itself,
> but then source size is 4x smaller).
> 
> Same comments for the SSE2 one. Please fix indenting, do the proper
> thing for SSE2 vs. SSE2SLOW if you have a Core1 or similar so we can
> test if SSE2_SLOW is actually faster than MMX2.

Sorry, no "SSE2fast" cpu here, thus cannot test.
And while *x264* think my processor is SSE2slow (=*amd* before sse4a),
*ffmpeg* think it is *SSE2-without-slow*.
CPUs that marked as SSE2slow in ffmpeg (*intel* pentium m etc), marked as "has
no SSE2&3 at all" in x264.

> For SSE2, please mark xmm6 and xmm7 as clobbered (see
> doc/optimization.txt), otherwise it breaks on Win64.

fixed.

> The getenv hack shouldn't be needed.

It was only added for easier benchmarking by someone with other SSE2 unit; as
name suggest, it is not intended for svn inclusion.

> Once you fix all the above, let's do a second round of review.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 00_vc1dsp-5.patch
Type: text/x-diff
Size: 8791 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110121/e8dc8c16/attachment.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 20_dsputil_mmx-3.patch
Type: text/x-diff
Size: 2585 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110121/e8dc8c16/attachment-0001.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 25_dsputil_mmx-sumsub2.patch
Type: text/x-diff
Size: 848 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110121/e8dc8c16/attachment-0002.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 26_dsputil_mmx-sumsub2-cavsdsp.patch
Type: text/x-diff
Size: 1331 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110121/e8dc8c16/attachment-0003.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 30_vc1dsp_mmx-7.patch
Type: text/x-diff
Size: 33535 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110121/e8dc8c16/attachment-0004.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 35_vc1dsp_mmx-reorder.patch
Type: text/x-diff
Size: 7981 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110121/e8dc8c16/attachment-0005.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 40_vc1dsp_sse2-6.patch
Type: text/x-diff
Size: 10411 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110121/e8dc8c16/attachment-0006.patch>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 45_vc1dsp_mmx-reorder-load-n-clamp.patch
Type: text/x-diff
Size: 7384 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110121/e8dc8c16/attachment-0007.patch>



More information about the ffmpeg-devel mailing list