[FFmpeg-devel] [PATCH] Fix function parameters for rgb48 to YV12 functions.
Ramiro Polla
ramiro.polla
Wed Feb 3 02:30:24 CET 2010
Hi,
On Tue, Feb 2, 2010 at 5:42 PM, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Tue, Feb 02, 2010 at 08:21:15PM +0100, Reimar D?ffinger wrote:
>> On Tue, Feb 02, 2010 at 08:01:26PM +0100, Michael Niedermayer wrote:
>> > On Tue, Feb 02, 2010 at 04:10:06PM -0200, Ramiro Polla wrote:
>> > > Hello Michael,
>> > >
>> > > On Sun, Jan 24, 2010 at 8:31 AM, Michael Niedermayer <michaelni at gmx.at> wrote:
>> > > > the gain happens when you change the variables used to calculate the index
>> > > > also to it. You could also try to make the index unsigned but make sure it
>> > > > cant be negative if you try this
>> > >
>> > > Sorry but I still don't understand how that will be of use here in
>> > > libswscale. I've tried forcing int32_t and int64_t for x86_64 in some
>> > > of those functions (some xxxTo(Y|UV), hScale and the fast bilinear
>> > > ones), in all C, MMX and MMX2. All I can see is the expansion from
>> > > 32-bit to 64-bit being changed from caller and callee. There is no
>> > > difference in the inner loop, nor in how gcc addresses the the src and
>> > > dst arrays.
>> >
>> > maybe theres no gain for swscale, i cant say without looking at the asm
>> > gcc generates.
>> > i know that in h264 gcc filled some functions with 32->64 sign extension
>> > code in the inner loops.
>>
>> Which compilation options have you been using?
>
> default of ffmpeg & gcc-4.4
> also a quick
> grep movslq libavcodec/h264_loopfilter.S | grep -v '('
> ? ?68 ? ? 136 ? ?1220
>
> and with -mtune=core2 -march=core2 -mcpu=core2
> grep movslq libavcodec/h264_loopfilter.S | grep -v '(' |wc
> ? ?68 ? ? 136 ? ?1226
>
> so no, its not helping it still does produce all the register-register
> sign extensions
Hmm, I think I understand now what you mean... This is what the asm of
some functions look like when things get changes from long to int.
I'll put the sizes of some functions as in <name> <size with long>
<size with int> <int - long>, along with their differences (mostly
only prologues). All tested with gcc 4.4.1 from ubuntu 9.10:
nv12ToUV_MMX 77 87 10
BEToUV_MMX 84 90 6
and similar _MMX functions.
int
lea (%r8,%r8,1),%r9d
movslq %r8d,%rax
neg %r8d
movslq %r8d,%r8
add %rax,%rdi
add %rax,%rsi
movslq %r9d,%r9
add %r9,%rdx
add %r9,%rcx
movq 0x0,%mm4
long:
lea (%r8,%r8,1),%rax
mov %r8,%r9
add %r8,%rdi
neg %r9
add %r8,%rsi
add %rax,%rdx
add %rax,%rcx
movq 0x0,%mm4
bgr24ToUV_half_3DNow 142 172 30
int:
test %r8d,%r8d
push %rbx
jle 1cd8a <bgr24ToUV_half_3DNow+0xaa>
sub $0x1,%r8d
xor %eax,%eax
lea 0x3(%r8,%r8,2),%rbx
add %rbx,%rbx
movzbl (%rdx,%rax,1),%ecx
movzbl 0x3(%rdx,%rax,1),%r9d
movzbl 0x4(%rdx,%rax,1),%r10d
movzbl 0x5(%rdx,%rax,1),%r8d
long:
test %r8,%r8
jle 1cb1c <bgr24ToUV_half_3DNow+0x8c>
lea (%rdi,%r8,1),%r8
movzbl (%rdx),%eax
movzbl 0x3(%rdx),%r9d
movzbl 0x4(%rdx),%r10d
movzbl 0x5(%rdx),%ecx
rgb32ToUV 139 143 4
int
sub $0x1,%r8d
lea 0x4(%rdx,%r8,4),%r11
mov (%rdx),%ecx
long
push %rbx
shl $0x2,%r8
xor %ecx,%ecx
mov (%rdx,%rcx,1),%r9d
all hyscale_fast functions have only one more movslq in the int version.
Then many have this difference where the int version uses sub and lea
while the long version uses either add %reg,%reg or shl $2, %reg.
abgrToA 37 45 8
BEToUV_C 51 51 0
nv12ToUV_C 48 48 0
int
sub $0x1,%r8d
lea 0x2(%r8,%r8,1),%r8
long
add %r8,%r8
rgb15ToUV 141 143 2
long uses rbx (as in it pushes and pops rbx) while the int version
doesn't, long accesses arrays with movzwl (%rdx,%rax,1),%r9d instead
of movzwl (%rdx),%ecx in the inner loop (I don't know what difference
this makes). long uses add %r8,%r8 instead of sub & lea.
Then there's:
rgb15ToUV_half 166 174 8
int
sub $0x1,%r8d
lea 0x4(%rdx,%r8,4),%r10
long
lea (%rdx,%r8,4),%r10
Very few functions are larger with long such as:
rgb15ToY 95 84 -11
int uses sub & lea instead of add. long uses more 64-bit registers so
the instructions are larger.
And on to the caller,
swScale_C 10319 10082 -237
long has 9 more movslq, uses more stack
I haven't checked all functions though.
The final size (with runtime cpudetect):
841336 swscale_ints.o
841024 swscale_longs.o
The number of movslq between registers:
$ objdump -d swscale_ints.o | grep movslq | grep -v "(" | wc -l
1038
$ objdump -d swscale_longs.o | grep movslq | grep -v "(" | wc -l
927
No speed differences were ever noticed. Dark_Shikari tells me a movslq
between registers is 1uop...
As for other architectures, the arm and ppc I have would have made no
difference since they're not 64-bit.
I've attached a patch which adds an array_index type, if that's what
you had in mind.
Otherwise I really don't know what to do. Long is being misused here,
and breaks compilation on mingw-w64.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: use_array_index.diff
Type: application/octet-stream
Size: 18604 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20100202/bb50c1cd/attachment.obj>
More information about the ffmpeg-devel
mailing list