[Ffmpeg-devel] [PATCH] h264 - loopify some get_cabac calls

Mon Mar 26 22:50:40 CEST 2007

On Mar 25, 2007, at 4:09 PM, Guillaume Poirier wrote:

> Hi,
>
> On 3/25/07, Alexander Strange <astrange at ithinksw.com> wrote:
>>
>> On Mar 24, 2007, at 7:46 PM, Guillaume Poirier wrote:
>>
>>
>> >> There's some more AltiVec code here we'll probably send soon:
>> >> http://trac.perian.org/ticket/113
>> > I had a quick look at http://trac.perian.org/attachment/ticket/113/
>> > altivec_lum.3.diff
>> >
>> > Even though I imagine this patch isn't yet ready to be submitted,
>> > I'd like to ask if the in your opinion, transpose routines can make
>> > do without accessing memory (do it all in registers).
>>
>> They actually do, that patch is just messy enough to hide it.
>> The functions transpose4x4 and readVector aren't ever called.
>>
>> transpose4/6x16 only do memory operations because the initial loads
>> and stores are integrated into them.
>
> Ok, I hadn't looked carefully.
>

I think what Alexander was referring to here was what I wrote in the  
comments to the patch.  It is not that it doesn't do any memory  
access, it is that it doesn't store the transposed data in temporary  
memory.  For an example of what I am talking about here, look at  
h264_h_loop_filter_luma_mmx2 in libavcodec/i386/h264dsp_mmx.c.   
There, it does the transpose from memory to memory, runs  
h264_loop_filter_luma_mmx2 memory to memory, then does another set of  
transposes from memory to memory.  I changed the above mentioned  
patch to do the transpose from memory to registers, run the  
h264_loop_filter_luma_altivec on the registers, and then do the  
remaining transpose from registers to memory.

I renamed the functions to make this a bit more clear.  Also,  
transpose4x4 and readVector were used in previous versions of the  
patch and I just forgot to remove them from the later versions.  See  
(http://trac.perian.org/attachment/ticket/113/altivec_lum.4.diff).

BTW, earlier, it did transposes from memory to memory.  Eliminating  
temporary memory in the middle did create a noticeable speed boast.

>
>> I think the stuff in transpose6x16 can be cleaned up; it should be
>> able to use vec_ste instead of copying the result array.
>>
>> But this is my first time studying it too; I didn't write it.
>
> Ok. Who may that be? I attached a patch that uses these altivec
> routines on x264. I looks like they don't produce bit-identical
> results as the C version, but maybe it's just because I haven't
> modifed what needed to to make them work on x264 environment.
>

That would be me :)

I don't know about x264.  Is its C routine different than h264?  If  
not, then I suppose I should do some more testing to see if I can  
find what the differences are.  I did do several tests, and from what  
I could see, it was identical.  Maybe a test vector that the altivec  
version fails would help.

I wanted to use vec_ste, but it has alignment requirements.  The data  
in the v_loop_filter is aligned, but this is not true of data in the  
h_loop_filter.  Surely there is a better way to do the memory storing  
rather than the integer pointer fun at the end of write16x4.

>
>> > Also more cycles could be saved if you take advantage of some known
>> > alignments (8-bytes aligned load/store can be made faster than a
>> > generic unaligned memory access)....
>>
>> Hm, doesn't Altivec use the same unaligned load method for both?
>> (load x and 15+x, merge them)
>
> Well, if you know the alignement in advance, you don't need to compute
> the permute vector, that's all. I doesn't same all that much, just a
> tiny bit.
>

Yeah, the read_unaligned routine handles unaligned reads.  The part  
that computes the mask is the only part which can be pre-computed.

> Guillaume
> -- 
> Rich, you're forgetting one thing here: *everybody* except you is
> stupid.
>    M?ns Rullg?rd

- Graham