[Ffmpeg-devel] alignment in H264 with altivec optimizations

Thu Apr 20 22:12:28 CEST 2006

Hi

I am making some Altivec optimizations to the ffmpeg H.264 decoder.

I have submitted a patch some months ago but it was rejected because it
does not produce correct decoded videos for some users.

I think that the problem is related with some assumptions that I made
with respect to the alignment of the pointers in some kernels:

1)
One case is the luma interpolation routines for 16x16 blocks:

static void PREFIX_h264_qpel16_h_lowpass_altivec(uint8_t * dst, uint8_t
* src, int dstStride, int srcStride);

The current altivec implementation assumes that stride is a multiple of
16 and that the "dst" pointer can have any unaligment value, because of
that there is a somewhat big code for aligning the final stores.

I have an implementation of these routines for 8x8 and 4x4 blocks but if
dst is unaligned (with an arbitrary unalignment) the overhead for
aligning the stores reduces dramatically the speed-up.

Question-1: Currently the dst pointer can have any unalignment?
It is possible to align the dst pointer?

2)
In the code for aligning the stores there is a section for loading and
aligning the destination.
    const vector unsigned char dst1 = vec_ld(0, dst);
    const vector unsigned char dst2 = vec_ld(16, dst);
    const vector unsigned char vdst = vec_perm(dst1, dst2, vec_lvsl(0,
dst));

question-2.
If the stride is a multiple of 16 it is possible to take the
vec_lvsl(0,dat) out of the loop?
If that's true I can submit a small patch for doing that.

3)
In the chroma interpolation routines, like
void PREFIX_h264_chroma_mc8_altivec(uint8_t * dst, uint8_t * src, int
stride, int h, int x, int y)
there is an assumption related to the alignment of the dst pointer, in
which the destination can only be aligned or have an unalignment of "8".

Question-3. The chroma destination pointer has a different alignment
than the Luma?

And finally
4) IDCT
In the inverse transforms of H264:
void ff_h264_idct_add_c(uint8_t *dst, DCTELEM *block, int stride);
and
void ff_h264_idct8_add_c(uint8_t *dst, DCTELEM *block, int stride);

What kind of assumptions I can make with respect to the alignment of the
 "dst" pointer?

With this information I can correct my implementation of these kernels
and possible speed-up the H264 decoder in the PPC architecture.

Thanks in advance for your comments.

Mauricio Alvarez