[FFmpeg-devel] Indeo3 replacement, part 2

Thu Oct 8 19:59:54 CEST 2009

Reimar D?ffinger schrieb:
> [...]
> These do the same thing to hi and lo independently, and each of these are 32 bits.
> I think this would be better as
> static inline uint32_t requant(uint32_t a) {
> #if HAVE_BIGENDIAN
>     a &= 0xFF00FF00;
>     a |= a >> 8;
> #else
>     a &= 0x00FF00FF;
>     a |= a << 8;
> #endif
>    return a;
> }
>
> That is of course unless it makes more sense to get rid of the lo/hi
> split very early on and just use uint64_t - that depends on how much the
> complier would mess that up on 32 bit architectures I think, for what I
> can tell it should work a lot better on 64 bit architectures.
>
>   

Ok, I've tested the possibility of using one uint64_t variable instead
of the hi/lo split. The really big trouble is not the compiler mess on
32bit machines but the endianness issue introduced through the intel's
design. Please look at the following scheme:

Indeo3 dyad correction (add 2 x 32bit delta):

1st DWORD(lo)   2nd DWORD(hi)
_______________    _______________
B1, B2, B3, B4    B5, B6, B7, B8
------------------------    ------------------------
byte1                  byte0

B1-B8 are pixels in the memory grouped into 2 x 32bit DWORDs
"byte0" and "byte1" are bytes in the bitstream telling which delta
should be applied

As you can see ALWAYS above is in the little-endian order because Intel
= LE!

If we have a big-endian machine we need to do this processing in
reverse; otherwise it won't work right...

If we would use a uint64_t variable splitted into hi/lo parts all what
we need is to swap the order of the hi/lo parts of the delta table at
the time of this generation. The resting code leaves unchanged because
we apply both delta parts separately in the right (little-endian) order
like this:

pix_lo = ref_pix_lo + delta_lo[byte1];
pix_hi = ref_pix_hi + delta_lo[byte0];

If we use one monolitic uint64_t variable we need to add some endianness
compensation like this:

if (HAVE_BIGENDIAN)
    FFSWAP(..., byte0, byte1);
delta64 = (delta_lo[byte0] << 32) + delta_lo[byte1];
pix64 = ref_pix64 + delta64;

As one can see it will be slower on big-endian architectures.

So I have the following dilemma:
monolitic uint64_t:
    advantages
        - the code is more compact and readable
        - better operation on 64bit architectures
    drawbacks
        - requires more instructions on 32bit architectures
        - requires extra code to handle endianness therefore tends to be
slower

splitted uint64_t:
    advantages
        - no extra endianness handling
        - better operation on 32bit architectures
    drawbacks
        - separated code doing the same for both hi/lo parts requiered
        - both C code and resulting machine code are bigger

Which design will be preferable? I cannot make any decision...

Regards
Maxim