[FFmpeg-devel] [PATCH 1/2] lavfi/transpose: support slice threading

Fri Aug 16 12:38:23 CEST 2013

On 8/16/13, Michael Niedermayer <michaelni at gmx.at> wrote:
> On Thu, Aug 15, 2013 at 11:07:55PM +0000, Paul B Mahol wrote:
>> On 8/15/13, Michael Niedermayer <michaelni at gmx.at> wrote:
>> > On Wed, Aug 14, 2013 at 09:39:32PM +0000, Paul B Mahol wrote:
>> >> Signed-off-by: Paul B Mahol <onemda at gmail.com>
>> >> ---
>> >>  libavfilter/vf_transpose.c | 72
>> >> ++++++++++++++++++++++++++++++----------------
>> >>  1 file changed, 47 insertions(+), 25 deletions(-)
>> >>
>> >> diff --git a/libavfilter/vf_transpose.c b/libavfilter/vf_transpose.c
>> >> index 3ee9c6d..82f68e5 100644
>> >> --- a/libavfilter/vf_transpose.c
>> >> +++ b/libavfilter/vf_transpose.c
>> >> @@ -133,31 +133,19 @@ static AVFrame *get_video_buffer(AVFilterLink
>> >> *inlink, int w, int h)
>> >>          ff_default_get_video_buffer(inlink, w, h);
>> >>  }
>> >>
>> >> -static int filter_frame(AVFilterLink *inlink, AVFrame *in)
>> >> +typedef struct ThreadData {
>> >> +    AVFrame *in, *out;
>> >> +} ThreadData;
>> >> +
>> >> +static int filter_slice(AVFilterContext *ctx, void *arg, int jobnr,
>> >> +                        int nb_jobs)
>> >>  {
>> >> -    TransContext *trans = inlink->dst->priv;
>> >> -    AVFilterLink *outlink = inlink->dst->outputs[0];
>> >> -    AVFrame *out;
>> >> +    TransContext *trans = ctx->priv;
>> >> +    ThreadData *td = arg;
>> >> +    AVFrame *out = td->out;
>> >> +    AVFrame *in = td->in;
>> >>      int plane;
>> >>
>> >> -    if (trans->passthrough)
>> >> -        return ff_filter_frame(outlink, in);
>> >> -
>> >> -    out = ff_get_video_buffer(outlink, outlink->w, outlink->h);
>> >> -    if (!out) {
>> >> -        av_frame_free(&in);
>> >> -        return AVERROR(ENOMEM);
>> >> -    }
>> >> -
>> >> -    out->pts = in->pts;
>> >> -
>> >> -    if (in->sample_aspect_ratio.num == 0) {
>> >> -        out->sample_aspect_ratio = in->sample_aspect_ratio;
>> >> -    } else {
>> >> -        out->sample_aspect_ratio.num = in->sample_aspect_ratio.den;
>> >> -        out->sample_aspect_ratio.den = in->sample_aspect_ratio.num;
>> >> -    }
>> >> -
>> >>      for (plane = 0; out->data[plane]; plane++) {
>> >>          int hsub = plane == 1 || plane == 2 ? trans->hsub : 0;
>> >>          int vsub = plane == 1 || plane == 2 ? trans->vsub : 0;
>> >> @@ -165,12 +153,14 @@ static int filter_frame(AVFilterLink *inlink,
>> >> AVFrame *in)
>> >>          int inh  = in->height  >> vsub;
>> >>          int outw = FF_CEIL_RSHIFT(out->width,  hsub);
>> >>          int outh = FF_CEIL_RSHIFT(out->height, vsub);
>> >> +        int start = (outh *  jobnr   ) / nb_jobs;
>> >> +        int end   = (outh * (jobnr+1)) / nb_jobs;
>> >
>> > squares should be faster than long thin rectangles
>> > (this should be also true for the single thread case)
>>
>> Sorry this does not make any sense to me.
>> If you got idea how to do it better than either go and do it or say
>> exactly what should be done and why.
>
> consider a 1024x1024 image, if you transpose it line wise either
> input or output will be accessing pixels along one column
> each of these accessed bytes will cause a cache line to be read,
> (64byte for example) so after processing 1024 pixels
> 1024 + 64*1024 byte would be in the cache, the L1 data cache of most
> cpus is probably smaller than that so you might end up with 50-100%
> L1 cache misses
>
> transposing a 32x32 or maybe 64x64 byte block OTOH should fit nicely
> in the L1 cache

Original code have same issue, and so this is unrelated and should be
done in separate commit.

By the way how much cache miss hurts up badly?

>
>
> [...]
>
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> When you are offended at any man's fault, turn to yourself and study your
> own failings. Then you will forget your anger. -- Epictetus
>