[FFmpeg-devel] [PATCH 2/4] avcodec/mips: MSA (MIPS-SIMD-Arch) optimizations for VP9 lpf functions

Thu Jul 16 17:08:52 CEST 2015

Hi,

On Thu, Jul 9, 2015 at 9:15 AM, <shivraj.patil at imgtec.com> wrote:

> +    if (__msa_test_bz_v(flat)) {
> +        p1_d = __msa_copy_u_d((v2i64) p1_out, 0);
> +        p0_d = __msa_copy_u_d((v2i64) p0_out, 0);
> +        q0_d = __msa_copy_u_d((v2i64) q0_out, 0);
> +        q1_d = __msa_copy_u_d((v2i64) q1_out, 0);
> +        SD4(p1_d, p0_d, q0_d, q1_d, (src - 2 * pitch), pitch);
> +    } else {
>

Can you elaborate on what this does? Does it check that none of the pixels
in the vector of cols/rows has flat=1, and takes a shortcut if that's true?
Of something else? (If I'm right in my assumption, can you please add a
comment to that effect?)

> +static void vp9_lpf_vertical_16_dual_msa(uint8_t *src, int32_t pitch,
> +                                         uint8_t *b_limit_ptr,
> +                                         uint8_t *limit_ptr,
> +                                         uint8_t *thresh_ptr)
> +{
> +    uint8_t early_exit = 0;
> +    uint8_t transposed_input[16 * 24] ALLOC_ALIGNED(ALIGNMENT);
> +    uint8_t *filter48 = &transposed_input[16 * 16];
> +
> +    vp9_transpose_16x16((src - 8), pitch, &transposed_input[0], 16);
> +
> +    early_exit = vp9_vt_lpf_t4_and_t8_16w((transposed_input + 16 * 8),
> +                                          &filter48[0], src, pitch,
> +                                          b_limit_ptr, limit_ptr,
> thresh_ptr);
> +
> +    if (0 == early_exit) {
> +        early_exit = vp9_vt_lpf_t16_16w((transposed_input + 16 * 8), src,
> pitch,
> +                                        &filter48[0]);
> +
> +        if (0 == early_exit) {
> +            vp9_transpose_16x16(transposed_input, 16, (src - 8), pitch);
> +        }
> +    }
> +}
>

Since no state is shared between t16 and t4/t8, it suggests you're
calculating some of the filters twice (since part of the condition of
whether to apply the t16 filter is whether to apply the t8 filter), is that
true? If so, do you think it's worth modifying this so the check on whether
to run t4 or t8 is not re-evaluated in t16?

+void ff_loop_filter_v_84_16_msa(uint8_t *src, ptrdiff_t stride,
> +                                int32_t e, int32_t i, int32_t h)
> +{
> +    uint8_t e1, i1, h1;
> +    uint8_t e2, i2, h2;
> +
> +    e1 = e & 0xff;
> +    i1 = i & 0xff;
> +    h1 = h & 0xff;
> +
> +    e2 = e >> 8;
> +    i2 = i >> 8;
> +    h2 = h >> 8;
> +
> +    vp9_lpf_horizontal_8_msa(src, stride, &e1, &i1, &h1, 1);
> +    vp9_lpf_horizontal_4_msa(src + 8, stride, &e2, &i2, &h2, 1);
> +}

So I think you're missing the point of why this exists. The simd code for
e.g. 88_16 suggests you're capable of doing 16 pixels at once in a single
iteration, right? The idea here is that you can use the fact that t4 is a
strict subset of t8 to run them both in the same iteration, with simply a
mask at the end to assure that "whether to run t8 or t4" for the t4 half of
the pixels is always 0. Look at the x86 simd code for details on how that
would work exactly.

Ronald