[FFmpeg-devel] [RFC] optimize ff_emulated_edge_mc

Sun Jan 2 19:05:43 CET 2011

Hi,

On Thu, Dec 30, 2010 at 5:26 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Wed, Dec 29, 2010 at 10:03:04PM -0500, Ronald S. Bultje wrote:
>> On Wed, Dec 29, 2010 at 8:06 PM, Ronald S. Bultje <rsbultje at gmail.com> wrote:
>> > emu_edge_mc looks optimizable and shows up in my profilings. A simple
>> > loop->memcpy makes things a lot faster already (see attached):
>> [..]
>> > after
>> [..]
>> > 6165 dezicycles in ff_emulated_edge_mc, 1048040 runs, 536 skips
>> > 6115 dezicycles in ff_emulated_edge_mc, 1048044 runs, 532 skips
>> > 6087 dezicycles in ff_emulated_edge_mc, 1048158 runs, 418 skips
>> >
>> > before
>> [..]
>> > 9104 dezicycles in ff_emulated_edge_mc, 1047805 runs, 771 skips
>> > 9131 dezicycles in ff_emulated_edge_mc, 1047866 runs, 710 skips
>> > 9097 dezicycles in ff_emulated_edge_mc, 1047874 runs, 702 skips
>> [..]
>>
>> Another few more changes attached, doing memcpy() on top/bottom edge
>> brings it to 540 cycles:
>>
>> 5414 dezicycles in ff_emulated_edge_mc, 1048331 runs, 245 skips
>>
>> and then reordering the left/right edge loop a little brings it to 520:
>>
>> 5186 dezicycles in ff_emulated_edge_mc, 1048288 runs, 288 skips
>>
>> I'm too lazy to run this multiple times.
>>
>> For the left/right edge fills, I tried using memset(), but that slows
>> it down considerably, it appears it doesn't inline it. Jason said he
>> saw the same on some compilers withthe memcpy() trick. Which makes me
>> think, maybe we can emulate the inline memset() trick with some more
>> elaborate C code? What I'm thinking is basically edge_val *=
>> 0x01010101U; while (to_write >= 4) write(edge_val); if (to_write&2)
>> write(edge_val); if (to_write & 1) write(edge_val); or so. Also, since
>> most time is spent in copying the blocks quite literally, the main
>> copy block could certainly use some optimizations, especially since
>> width is generally something like 16...
>>
>> Ronald
>
>> ?dsputil.c | ? 22 ++++++++++------------
>> ?1 file changed, 10 insertions(+), 12 deletions(-)
>> 6b5be1a69247178dd53af1f622a49750d231045d ?emu_edge_mc.patch
>
> feel free to commit whatever makes ff_emulated_edge_mc() faster

Attached is a more reviewable version. It contains basically similar
changes as above to the C version, plus I've added the function to
DSPContext and have all decoders use it. It's now (for VP8) down from
>1000 cycles (see above) to ~259 cycles, or 4x as fast as original and
about 2x as fast as the faster C variant in my original post. All this
on a Core i7, Elephants Dream sample on a Macbook Pro / OSX 10.6.

Here's what it does different than the C version:
- memcpy-style copy of top/bottom edge and body uses movdqu and then
only mov for the remaining 8/4/2/1 bytes
- left/right edge writing decision is made once, and then the loop is
largely branchless - this could be done for the C version also perhaps
- the left/right edges are written two bytes at a time (makes a little
bit of a difference, I tried 4/8 bytes also but that's slower,
probably because we now need to ensure we write the correct amount of
bytes, whereas for 2, we can overwrite by one into the edge pixel
itself and then it doesn't matter I like how you can mov %al, %ah
without destroying the lower 8bits, unfortunate that that's not
possible for any part of the general registers (or xmm/mmx
registers)...

Has no effect on h264 decoding unless CODEC_FLAG_EMU_EDGE is set btw -
this is different for VP8 because in VP8, we don't use buffer edges
for MC, but for intra prediction. Maybe this can be [mf]ixed in
ffvp8dec at some point. For H264 cathedral sample with -flags
emu_edge, it goes from 1310 cycles (1.0% of complete ffmpeg process
according to my profiling) for original code to 980 for my modified C
to 377 for my ASM version, total decoding time 16.426 sec for original
C -> 16.370 for modified C -> 16.315 sec for asm = ~0.68% saved
(average of 4 runs each), so wrt H264 it's relevant for VLC or other
apps that use CODEC_FLAG_EMU_EDGE, e.g. for direct rendering.

There's a use of ff_emulated_edge_mc() in gmc_mmx which is harder,
since we don't have access to the function pointers in DSPContext. I
created (for x86-64) a gmc_sse which calls this function instead of
the C. gmc_mmx no longer exists on x86-64, which is OK since every
x86-64 CPU has SSE. Patch passes make fate but I didn't measure the
effect on decoding time for each codec specifically - too much work...

Ronald
-------------- next part --------------
A non-text attachment was scrubbed...
Name: emu_edge_mc.patch
Type: application/octet-stream
Size: 30781 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20110102/284cd7b6/attachment.obj>