[Ffmpeg-devel] Re: [PATCH] fix snow x86 SIMD

John Dalgliesh johnd
Thu Aug 10 16:52:10 CEST 2006

On Thu, 10 Aug 2006, Michael Niedermayer wrote:
> On Thu, Aug 10, 2006 at 02:42:11PM +0200, Martin von Gagern wrote:
>> The code currently fails regression tests when configured with
>>   --extra-cflags="-DPIC -fPIC -fomit-frame-pointer"
>>   --disable-static --enable-shared
>> The program dies with SIGSEGV at the position corresponding to line 776:
>>              "paddd (%%"REG_D"), %%xmm0      \n\t"
>> After some debugging I found out that REG_D % 16 == 8 at this location,
>> whereas IA32 Handbook requires it to be 16 byte aligned, causing a
>> general protection exception otherwise, which explains an SIGSEGV imho.
>> Would you go through your code and see if it correctly ensures 16 byte
>> alignment for the address assembled in REG_D? Or maybe add a few
>> assertions so I can test where things are no longer aligned that should
>> have been? I would very much appreciate that.
> if the code where not guranteeing alignment it likely would have failed
> on someone elses system ...

I had a similar failure with OS X on a different paddd instruction in the 
ff_snow_horizontal_compose97i_sse2 function (line 159). ffmpeg crashed 
with an illegal instruction signal.

> * check that av_malloc() provides aligned memory
> * check that gcc assembled the code correctly, ive seen cases where gcc
>  silently violated some constrains
> * try valgrind, maybe it finds some out of array or uninitalized accesses
> * look at sb->line which is where edx comes from i think

In my case the problem IMHO is gcc. Despite the efforts of this line 28: 
DWTELEM * const temp = temp_buf + 4 - (((int)temp_buf & 0xF) >> 2);
.. temp does not end up aligned on a 16 byte boundary.

The first for loop in Lift 3 then effectively aligns temp but offsets the 
src and b pointers. So I think that loop is quite useless because if it 
ever executes it will break the asm code below (unless I am missing 

The reason that the code to align temp doesn't work is because gcc has 
optimised out the subexpression: '(((int)temp_buf & 0xF) >> 2)' giving it 
a value of 0. It appears that it believes that temp_buf will always be 
a multiple of 16. However when it executes it is not so.

I didn't post this earlier because I hadn't finished investigating it, to 
see if I can reproduce it in a simple program (so far no - gcc always 
aligns var-len arrays to 16 bytes), or to see if it's been fixed in a 
later gcc or acknowledged as a bug (I'm using 4.0.1).

I'm too tired to continue this tonight, but here is the workaround I am 
currently using:
DWTELEM * const temp = (DWTELEM*)(((intptr_t)&temp_buf[4])&~0xF);

Disclaimer: I don't know if my crash is related. I couldn't easily locate 
a var-len array used by the line quoted above. So just FYI then.


btw Not 20 hours I know... got back early enough for a quick email :)

More information about the ffmpeg-devel mailing list