[FFmpeg-devel] [PATCH] Fix function parameters for rgb48 to YV12 functions.

Wed Feb 3 02:30:24 CET 2010

Hi,

On Tue, Feb 2, 2010 at 5:42 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Tue, Feb 02, 2010 at 08:21:15PM +0100, Reimar D?ffinger wrote:
>> On Tue, Feb 02, 2010 at 08:01:26PM +0100, Michael Niedermayer wrote:
>> > On Tue, Feb 02, 2010 at 04:10:06PM -0200, Ramiro Polla wrote:
>> > > Hello Michael,
>> > >
>> > > On Sun, Jan 24, 2010 at 8:31 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> > > > the gain happens when you change the variables used to calculate the index
>> > > > also to it. You could also try to make the index unsigned but make sure it
>> > > > cant be negative if you try this
>> > >
>> > > Sorry but I still don't understand how that will be of use here in
>> > > libswscale. I've tried forcing int32_t and int64_t for x86_64 in some
>> > > of those functions (some xxxTo(Y|UV), hScale and the fast bilinear
>> > > ones), in all C, MMX and MMX2. All I can see is the expansion from
>> > > 32-bit to 64-bit being changed from caller and callee. There is no
>> > > difference in the inner loop, nor in how gcc addresses the the src and
>> > > dst arrays.
>> >
>> > maybe theres no gain for swscale, i cant say without looking at the asm
>> > gcc generates.
>> > i know that in h264 gcc filled some functions with 32->64 sign extension
>> > code in the inner loops.
>>
>> Which compilation options have you been using?
>
> default of ffmpeg & gcc-4.4
> also a quick
> grep movslq libavcodec/h264_loopfilter.S | grep -v '('
> ? ?68 ? ? 136 ? ?1220
>
> and with -mtune=core2 -march=core2 -mcpu=core2
> grep movslq libavcodec/h264_loopfilter.S | grep -v '(' |wc
> ? ?68 ? ? 136 ? ?1226
>
> so no, its not helping it still does produce all the register-register
> sign extensions

Hmm, I think I understand now what you mean... This is what the asm of
some functions look like when things get changes from long to int.
I'll put the sizes of some functions as in <name> <size with long>
<size with int> <int - long>, along with their differences (mostly
only prologues). All tested with gcc 4.4.1 from ubuntu 9.10:

nv12ToUV_MMX    77  87  10
BEToUV_MMX      84  90  6
and similar _MMX functions.
int
    lea    (%r8,%r8,1),%r9d
    movslq %r8d,%rax
    neg    %r8d
    movslq %r8d,%r8
    add    %rax,%rdi
    add    %rax,%rsi
    movslq %r9d,%r9
    add    %r9,%rdx
    add    %r9,%rcx
    movq   0x0,%mm4
long:
    lea    (%r8,%r8,1),%rax
    mov    %r8,%r9
    add    %r8,%rdi
    neg    %r9
    add    %r8,%rsi
    add    %rax,%rdx
    add    %rax,%rcx
    movq   0x0,%mm4

bgr24ToUV_half_3DNow    142 172 30
int:
    test   %r8d,%r8d
    push   %rbx
    jle    1cd8a <bgr24ToUV_half_3DNow+0xaa>
    sub    $0x1,%r8d
    xor    %eax,%eax
    lea    0x3(%r8,%r8,2),%rbx
    add    %rbx,%rbx
    movzbl (%rdx,%rax,1),%ecx
    movzbl 0x3(%rdx,%rax,1),%r9d
    movzbl 0x4(%rdx,%rax,1),%r10d
    movzbl 0x5(%rdx,%rax,1),%r8d
long:
    test   %r8,%r8
    jle    1cb1c <bgr24ToUV_half_3DNow+0x8c>
    lea    (%rdi,%r8,1),%r8
    movzbl (%rdx),%eax
    movzbl 0x3(%rdx),%r9d
    movzbl 0x4(%rdx),%r10d
    movzbl 0x5(%rdx),%ecx

rgb32ToUV   139 143 4
int
    sub    $0x1,%r8d
    lea    0x4(%rdx,%r8,4),%r11
    mov    (%rdx),%ecx
long
    push   %rbx
    shl    $0x2,%r8
    xor    %ecx,%ecx
    mov    (%rdx,%rcx,1),%r9d

all hyscale_fast functions have only one more movslq in the int version.

Then many have this difference where the int version uses sub and lea
while the long version uses either add %reg,%reg or shl $2, %reg.

abgrToA         37  45  8
BEToUV_C        51  51  0
nv12ToUV_C      48  48  0
int
    sub    $0x1,%r8d
    lea    0x2(%r8,%r8,1),%r8
long
    add    %r8,%r8

rgb15ToUV       141 143 2
long uses rbx (as in it pushes and pops rbx) while the int version
doesn't, long accesses arrays with movzwl (%rdx,%rax,1),%r9d instead
of movzwl (%rdx),%ecx in the inner loop (I don't know what difference
this makes). long uses add %r8,%r8 instead of sub & lea.

Then there's:
rgb15ToUV_half  166 174 8
int
    sub    $0x1,%r8d
    lea    0x4(%rdx,%r8,4),%r10
long
    lea    (%rdx,%r8,4),%r10

Very few functions are larger with long such as:
rgb15ToY        95  84  -11
int uses sub & lea instead of add. long uses more 64-bit registers so
the instructions are larger.

And on to the caller,

swScale_C       10319   10082   -237
long has 9 more movslq, uses more stack

I haven't checked all functions though.

The final size (with runtime cpudetect):
841336 swscale_ints.o
841024 swscale_longs.o

The number of movslq between registers:
$ objdump -d swscale_ints.o | grep movslq | grep -v "(" | wc -l
1038
$ objdump -d swscale_longs.o | grep movslq | grep -v "(" | wc -l
927

No speed differences were ever noticed. Dark_Shikari tells me a movslq
between registers is 1uop...

As for other architectures, the arm and ppc I have would have made no
difference since they're not 64-bit.

I've attached a patch which adds an array_index type, if that's what
you had in mind.

Otherwise I really don't know what to do. Long is being misused here,
and breaks compilation on mingw-w64.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: use_array_index.diff
Type: application/octet-stream
Size: 18604 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100202/bb50c1cd/attachment.obj>