[FFmpeg-devel] [PATCH] RoQ video encoder (take 4)

Michael Niedermayer michaelni
Thu Jun 14 03:39:18 CEST 2007


Hi

On Wed, Jun 13, 2007 at 12:09:41PM +0200, Vitor wrote:
> Hi
> 
> Michael Niedermayer wrote:
> >Hi
> >  
> 
> I've followed your suggestion in the message 
> http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2007-June/030406.html
> and it did decrease the complexity of the patch.
> 
> I think that this encoder is still a bit verbose compared to the average 
> ffmpeg encoder (it goes through all the blocks to do motion detection, 
> then go through all of it again to find best codebook, etc), but fixing 
> this will not give a noticeable speed gain nor make the code simpler 
> (just smaller) so I don't think it is worth the work.

making it smaller is also a big advantage, less source to read, less memory
less code cache space needed ...


[...]
> >>+static void enlarge_roq_mb4(uint8_t base[3][16], uint8_t u[3][64])
> >>+{
> >>+    int x,y,cp;
> >>+
> >>+    for(y=0;y<8;y++)
> >>+        for(x=0;x<8;x++)
> >>+            for(cp=0;cp<3;cp++)
> >>+                u[cp][y*8+x] = base[cp][(y/2)*4 + (x/2)];
> >>    
> >
> >the for(cp loop should be the outermost IMHO
> >also
> >
> >enlarge_roq_mb4(uint8_t base[3][4][4], uint8_t u[3][8][8])
> >
> >u[cp][y][x] = base[cp][y>>1][x>>1];
> >seems more readable
> >
> >  
> 
> I don't know if I understand your suggestion, but if I do, I really 
> don't like passing a [3][16] vector to a function that expects a 
> [3][4][4]. I agree it is more readable, but every time I look at it I 
> think I spotted an error...

you understood my suggestion correctly, anyway this is not important, we
can leave this code as is if you prefer


> 
> >>+
> >>+/**
> >>+ * Template code to find the codebook with the lowest distortion from an 
> >>image
> >>+ */
> >>+#define GET_LOWEST_CB_ERR(FUNCT, DIM, ERR_COMMAND) \
> >>+static int FUNCT(uint8_t cluster[3][DIM], uint8_t cb[][3][DIM], int 
> >>numCB, int *outIndex) \
> >>+{ \
> >>+    int diff; \
> >>+    int pick=0, lDiff; \
> >>+    int i=0; \
> >>+\
> >>+    lDiff = INT_MAX; \
> >>+\
> >>+    /* Diff against the others */ \
> >>+    for (i=0; i<numCB; i++) { \
> >>+        diff = ERR_COMMAND \
> >>+        if (diff < lDiff) { \
> >>+            lDiff = diff; \
> >>+            pick = i; \
> >>+        } \
> >>+    } \
> >>+\
> >>+    *outIndex = pick; \
> >>+    return lDiff; \
> >>+}
> >>+
> >>+
> >>+GET_LOWEST_CB_ERR(index_mb2,  4, squared_diff_mb2(cluster, cb[i]);)
> >>+GET_LOWEST_CB_ERR(index_mb4, 16, squared_diff_mb4(cluster, cb[i]);)
> >>+GET_LOWEST_CB_ERR(index_mb8, 64, squared_diff_mb8(cluster, cb[i]);)
> >>    
> >
> >same question here, why is this not a normal funcion?
> >
> >  
> 
> I can't see a good way to avoid passing the parameter cb[][3*DIM] 
> without making cb a one-dimensional vector (which I think is less readable).

i dont really agree, a 1D array looks perefectly fine to me
this macro OTOH though does not look readable at all ...


[...]
> +#define CHROMA_BIAS 2

why 2?
does it look better? or is it just because it gives chroma in 4:4:4 the same
weight as luma?


[...]
> +static void unpack_roq_qcell(uint8_t cb2[][4*3], roq_qcell *qcell, uint8_t u[4*4*3])
> +{
> +    uint8_t base[4][4*3];
> +    int n,cp,i;
> +    static int offsets[4] = {0, 2, 8, 10};
> +
> +    for(i=0;i<4;i++)
> +        for(cp=0; cp<3; cp++)
> +            for(n=0; n<4; n++)
> +                base[i][cp+3*n]=cb2[qcell->idx[i]][cp+3*n];

for(i=0;i<4;i++)
    memcpy(base[i], cb2[qcell->idx[i]], 3*4);


[...]
> +// FIXME Could use DSPContext.sse, but it is not so speed critical (used
> +// just for motion estimation).
> +static int block_sse(uint8_t **buf1, uint8_t **buf2, int x1, int y1, int x2, int y2,
> +                     int stride, int size)
> +{
> +    int i, j, k, bias;
> +    int sse=0;
> +
> +    for (k=0; k<3; k++) {
> +      bias = (k ? CHROMA_BIAS : 1);
> +        for (i=0; i<size; i++)

indention looks wrong


[...]
> +static int eval_motion_dist(RoqContext *enc, int x, int y, int mx, int my,
> +                             int size)
> +{
> +    if (mx < -7 || mx > 7)
> +        return INT_MAX;
> +
> +    if (my < -7 || my > 7)
> +        return INT_MAX;
> +
> +    if ((x + mx - size < 0) || (x + mx > enc->width - size))
> +        return INT_MAX;
> +
> +    if ((y + my - size < 0) || (y + my > enc->height - size))
> +        return INT_MAX;

is (y + my - size < 0) correct? shouldnt it be y + my < 0


> +
> +
> +    return block_sse(enc->frame_to_enc->data, enc->last_frame->data, x, y,
> +                     x+mx, y+my, enc->y_stride, size);
> +}

mx+=x;
my+=y;

if ((unsigned)mx > enc->width - size || (unsigned)my > enc->height - size)
    return INT_MAX;

return block_sse(enc->frame_to_enc->data, enc->last_frame->data, x, y,
                 mx, my, enc->y_stride, size);

also block_sse could be split like:


              block_sse(dst[0], src[0], x, y, mx, my, enc->y_stride, size)
+ CHROMA_BIAS*block_sse(dst[1], src[1], x, y, mx, my, enc->c_stride, size)
+ CHROMA_BIAS*block_sse(dst[2], src[2], x, y, mx, my, enc->c_stride, size);


[...]
> +/**
> + * Initializes cel evaluators and sets their source coordinates
> + */
> +static int create_cel_evals(RoqContext *enc, roq_tempdata_t *tempData)
> +{
> +    int n,x,y;
> +
> +    tempData->cel_evals = av_malloc(enc->width*enc->height/64 * sizeof(cel_evaluation_t));
> +    if (!tempData->cel_evals)
> +        return -1;
> +
> +    memset(tempData->codebooks.usedCB2, 0, 256*sizeof(int));
> +    memset(tempData->codebooks.usedCB4, 0, 256*sizeof(int));
> +
> +    tempData->mainChunkSize = 0;
> +    tempData->used_option[0]=
> +      tempData->used_option[1]=
> +      tempData->used_option[2]=
> +      tempData->used_option[3]= 0;

the whole tempData is memset(0) before this function so the above zeroing is
not needed


[...]
> +            tempData->cel_evals[n].sourceX = x;
> +            tempData->cel_evals[n++].sourceY = y;
> +
> +            tempData->cel_evals[n].sourceX = x+8;
> +            tempData->cel_evals[n++].sourceY = y;
> +
> +            tempData->cel_evals[n].sourceX = x;
> +            tempData->cel_evals[n++].sourceY = y+8;
> +
> +            tempData->cel_evals[n].sourceX = x+8;
> +            tempData->cel_evals[n++].sourceY = y+8;

for(i=0; i<4; i++){
    tempData->cel_evals[n  ].sourceX = x * (i&1)*8;
    tempData->cel_evals[n++].sourceY = y * (i&2)*4;
}


[...]
> +/**
> + * Get macroblocks from parts of the image
> + */
> +static void get_frame_mb(AVFrame *frame, int x, int y, uint8_t mb[], int dim)
> +{
> +    int i, j, cp;
> +    int stride = frame->linesize[0];
> +
> +    for (i=0; i<dim; i++)
> +        for (j=0; j<dim; j++)
> +            for (cp=0; cp<3; cp++)
> +                mb[i*3*dim+j*3+cp] = frame->data[cp][(y+i)*stride+x+j];
> +}

why is the planar data converted to packed?


[...]

> +#define EVAL_MOTION(MOTION) \
> +    do { \
> +        diff = eval_motion_dist(enc, j, i, (MOTION).d[0], \
> +                                 (MOTION).d[1], blocksize); \
> +            \
> +        if (diff < lowestdiff) { \
> +            lowestdiff = diff; \
> +            bestpick = MOTION; \
> +        } \
> +    } while(0)

this can be an inline function


> +
> +static void motion_search(RoqContext *enc, int blocksize)
> +{
> +    static motion_vect offsets[9] = {

static const


> +        {{ 0, 0}},
> +        {{ 0,-1}},
> +        {{ 0, 1}},
> +        {{-1, 0}},
> +        {{ 1, 0}},
> +        {{-1, 1}},
> +        {{ 1,-1}},
> +        {{-1,-1}},
> +        {{ 1, 1}},
> +    };

i think the { 0, 0} makes no sense as it has already been tested


[...]
> +/**
> + * Gets distortion for all options available to a subcel
> + */
> +static void gather_data_for_subcel(subcel_evaluation_t *subcel, int x,
> +                                   int y, RoqContext *enc, roq_tempdata_t *tempData)
> +{
> +    uint8_t mb4[4*4*3];
> +    uint8_t mb2[2*2*3];
> +    int cluster_index;
> +    int i, best_dist;
> +
> +    static int ssc_offsets[4][2] = {

static const


> +        {0,0},
> +        {2,0},
> +        {0,2},
> +        {2,2},
> +    };
> +
> +    static int bitsUsed[4] = {2, 10, 10, 34};

static const


[...]
> +    best_dist = INT_MAX;
> +    for (i=0; i<4; i++)
> +        if (subcel->eval_dist[i]+enc->lambda*bitsUsed[i] < best_dist) {
> +            subcel->best_coding = i;
> +            subcel->best_bit_use = bitsUsed[i];
> +            best_dist = subcel->eval_dist[i]+enc->lambda*bitsUsed[i];
> +        }
> +
> +}

is the empty line between the 2 } intended?


[...]

> +
> +/**
> + * Gathers SE data for all feasible possibilities
> + */
> +static int gather_cel_data(RoqContext *enc, roq_tempdata_t *tempData)
> +{
> +    int max;
> +    cel_evaluation_t *cel_evals;
> +
> +    max = enc->width*enc->height/64;
> +    cel_evals = tempData->cel_evals;
> +
> +    while (max--) {
> +        gather_data_for_cel(cel_evals, enc, tempData);
> +        cel_evals++;
> +    }

isnt

for(i=0; i<enc->width*enc->height/64; i++)
    gather_data_for_cel(cel_evals+i, enc, tempData);

more readable?


[...]

> +    idx = 0;
> +    for (i=0; i<256; i++) {
> +        if (tempData->codebooks.usedCB2[i] || tempData->numCB4 == 256) {
> +            tempData->i2f2[i] = idx;
> +            tempData->f2i2[idx] = i;
> +            idx++;
> +        }
> +    }
> +    tempData->numCB2 = idx;
> +
> +
> +    if(tempData->numCB4 == 256)
> +        tempData->numCB2 = 256;

this if() does seem unneeded
btw, what is the purpose of the numCB4 == 256 special case above this if()?

i also think that it would be better to drop more than just the unused cbs
but rather calculate the precisse rate distortion (bits*lambda+sse)
difference which would result from droping a specific cb and drop it if
the rate distortion decreases
though this is just an idea for a future patch ...


> +
> +}
> +
> +static void write_codebooks(RoqContext *enc, roq_tempdata_t *tempData)
> +{
> +    int i, j;
> +
> +    /* Create codebook chunk */
> +    if (tempData->numCB2) {
> +        bytestream_put_le16(&enc->out_buf, RoQ_QUAD_CODEBOOK);
> +
> +        bytestream_put_le32(&enc->out_buf, tempData->numCB2*6 +
> +                            tempData->numCB4*4);
> +
> +        bytestream_put_byte(&enc->out_buf, tempData->numCB4);
> +        bytestream_put_byte(&enc->out_buf, tempData->numCB2);
> +
> +        for (i=0; i<tempData->numCB2; i++) {
> +            for (j=0; j<4; j++)
> +                bytestream_put_byte(&enc->out_buf, enc->cb2x2[tempData->f2i2[i]].y[j]);
> +
> +            bytestream_put_byte(&enc->out_buf, enc->cb2x2[tempData->f2i2[i]].u);
> +            bytestream_put_byte(&enc->out_buf, enc->cb2x2[tempData->f2i2[i]].v);
> +        }
> +
> +        for (i=0; i<tempData->numCB4; i++)
> +            for (j=0; j<4; j++)
> +                bytestream_put_byte(&enc->out_buf, tempData->i2f2[enc->cb4x4[tempData->f2i4[i]].idx[j]]);
> +
> +    }
> +}

a uint8_t **outp= &enc->out_buf;
would make the code slightly more readable IMHO


> +
> +/* NOTE: Typecodes must be spooled AFTER arguments!! */
> +#define SPOOL_ARGUMENT(arg)        \
> +do {\
> +    argumentSpool[argumentSpoolLength++] = (uint8_t)(arg);\
> +} while(0)

why hide this behind a macro?
because the variable names are too long maybe? :)


> +
> +#define SPOOL_MOTION(dx, dy)    \
> +do {\
> +    uint8_t arg, ax, ay;\
> +    ax = 8 - (uint8_t)(dx);\
> +    ay = 8 - (uint8_t)(dy);\
> +    arg = (uint8_t)(((ax&15)<<4) | (ay&15));\
> +    SPOOL_ARGUMENT(arg);\
> +} while(0)

there is absolutely no reason why this is a macro and not a function



> +
> +
> +
> +#define SPOOL_TYPECODE(type)        \
> +do {\
> +    int a;\
> +    typeSpool |= ((type) & 3) << (14 - typeSpoolLength);\
> +    typeSpoolLength += 2;\
> +    if (typeSpoolLength == 16) {\
> +        bytestream_put_le16(&enc->out_buf, typeSpool); \
> +        for (a=0; a<argumentSpoolLength; a++)\
> +            bytestream_put_byte(&enc->out_buf, argumentSpool[a]); \
> +        typeSpoolLength = 0;\
> +        typeSpool = 0;\
> +        argumentSpoolLength = 0;\
> +    }\
> +} while(0)

instead of this complicated fifo spool thingy, why not

use
bytestream_put_byte(&out, arg); instead of
SPOOL_ARGUMENT(arg)

and

write_typecode(){
    spool |= (type & 3) << (14 - spool_length);
    spool_length += 2;
    if (spool_length == 16){
        AV_WL16(typep, spool);
        spool= 0;
        typep= out;
        out+=2;
        spool_length=0;
    }
}


[...]
> +    static int subcelOffsets[4][2] = {
> +        {0,0},
> +        {4,0},
> +        {0,4},
> +        {4,4},
> +    };
> +
> +    static int subsubcelOffsets[4][2] = {
> +        {0,0},
> +        {2,0},
> +        {0,2},
> +        {2,2},
> +    };

subcelOffsets[x][0] = 4*(x&1)
subcelOffsets[x][1] = 2*(x&2)

so these tables arent really usefull IMHO, they just make the code harder
to read



> +
> +    if (tempData->used_option[RoQ_ID_CCC]%2)
> +        tempData->mainChunkSize+=8; //FIXME
> +
> +    /* Write the video chunk header */
> +    bytestream_put_le16(&enc->out_buf, RoQ_QUAD_VQ);
> +    bytestream_put_le32(&enc->out_buf, tempData->mainChunkSize/8);
> +    bytestream_put_byte(&enc->out_buf, 0x0);
> +    bytestream_put_byte(&enc->out_buf, 0x0);
> +
> +    start_buf=enc->out_buf;
> +    for (i=0; i<numBlocks; i++) {
> +        eval = tempData->cel_evals + i;
> +
> +        x = eval->sourceX;
> +        y = eval->sourceY;
> +
> +        switch (eval->best_coding) {
> +        case RoQ_ID_MOT:
> +            SPOOL_TYPECODE(RoQ_ID_MOT);
> +            break;
> +
> +        case RoQ_ID_FCC:
> +            SPOOL_MOTION(eval->motionX, eval->motionY);
> +            SPOOL_TYPECODE(RoQ_ID_FCC);
> +            ff_apply_motion_8x8(enc, x, y, eval->motionX, eval->motionY);
> +            break;
> +
> +        case RoQ_ID_SLD:
> +            SPOOL_ARGUMENT(tempData->i2f4[eval->cbEntry]);
> +            SPOOL_TYPECODE(RoQ_ID_SLD);
> +
> +            qcell = enc->cb4x4 + eval->cbEntry;
> +            ff_apply_vector_4x4(enc, x  , y  , enc->cb2x2 + qcell->idx[0]);
> +            ff_apply_vector_4x4(enc, x+4, y  , enc->cb2x2 + qcell->idx[1]);
> +            ff_apply_vector_4x4(enc, x  , y+4, enc->cb2x2 + qcell->idx[2]);
> +            ff_apply_vector_4x4(enc, x+4, y+4, enc->cb2x2 + qcell->idx[3]);
> +            break;
> +
> +        case RoQ_ID_CCC:
> +            SPOOL_TYPECODE(RoQ_ID_CCC);
> +            for (j=0; j<4; j++) {
> +                subX = subcelOffsets[j][0] + x;
> +                subY = subcelOffsets[j][1] + y;
> +
> +                switch(eval->subCels[j].best_coding) {
> +                case RoQ_ID_MOT:
> +                    SPOOL_TYPECODE(RoQ_ID_MOT);
> +                    break;
> +
> +                case RoQ_ID_FCC:
> +                    SPOOL_MOTION(eval->subCels[j].motionX,
> +                                 eval->subCels[j].motionY);
> +
> +                    SPOOL_TYPECODE(RoQ_ID_FCC);
> +                    ff_apply_motion_4x4(enc, subX, subY,
> +                                        eval->subCels[j].motionX,
> +                                        eval->subCels[j].motionY);
> +
> +                    break;
> +
> +                case RoQ_ID_SLD:
> +                    SPOOL_ARGUMENT(tempData->i2f4[eval->subCels[j].cbEntry]);
> +                    SPOOL_TYPECODE(RoQ_ID_SLD);
> +
> +                    qcell = enc->cb4x4 + eval->subCels[j].cbEntry;
> +
> +                    ff_apply_vector_2x2(enc, subX  , subY  ,
> +                                        enc->cb2x2 + qcell->idx[0]);
> +                    ff_apply_vector_2x2(enc, subX+2, subY  ,
> +                                        enc->cb2x2 + qcell->idx[1]);
> +                    ff_apply_vector_2x2(enc, subX  , subY+2,
> +                                        enc->cb2x2 + qcell->idx[2]);
> +                    ff_apply_vector_2x2(enc, subX+2, subY+2,
> +                                        enc->cb2x2 + qcell->idx[3]);
> +                    break;
> +
> +                case RoQ_ID_CCC:
> +                    for (k=0; k<4; k++) {
> +                        SPOOL_ARGUMENT(tempData->i2f2[eval->subCels[j].subCels[k]]);
> +                        ff_apply_vector_2x2(enc,
> +                                            subX+subsubcelOffsets[k][0],
> +                                            subY+subsubcelOffsets[k][1],
> +                                            enc->cb2x2 + eval->subCels[j].subCels[k]);
> +                    }
> +                    SPOOL_TYPECODE(RoQ_ID_CCC);
> +                    break;
> +                }

SPOOL_TYPECODE(eval->subCels[j].best_coding);


[...]
> +/**
> + * Create a single YUV cell from a 2x2 section of the image
> + */
> +static inline void frame_block_to_cell(uint8_t *block, uint8_t **data,
> +                                       int top, int left, int *stride)
> +{
> +    int i, u=0, v=0;
> +    static int offsets[4][2] = {{0,0},
> +                                {0,1},
> +                                {1,0},
> +                                {1,1}};

this can be replaced by i&1 / i&2


[...]
> +    if (size == 4)
> +      closest_cb = av_malloc(6*c_size*inputCount*sizeof(int));
> +    else
> +      closest_cb = tempdata->closest_cb2;

indention


> +
> +    ff_init_elbg(points, 6*c_size, inputCount, codebook, 256, 1, closest_cb, &enc->randctx);
> +    ff_do_elbg(points, 6*c_size, inputCount, codebook, 256, 1, closest_cb, &enc->randctx);

the size of the codebook should be decided based on rate distortion
(bits per codebook is known and distortion is too during elbg)
this is for a seperate patch though if you want to try it


[...]
> +            idx[j] = enc->cb4x4[i].idx[j];

unused


[...]

> +typedef struct {
> +    int d[2];
> +} motion_vect;

why not drop this struct and use int [2] directly?


[...]

> +    double lambda;

floating point code is very very bad as it means that the output is not binary
identical between architectures or compilers so regression tests wouldnt be
possible

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

If you really think that XML is the answer, then you definitly missunderstood
the question -- Attila Kinali
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20070614/34788378/attachment.pgp>



More information about the ffmpeg-devel mailing list