[FFmpeg-devel] Armada 370 problem causes ffmpeg segmentation fault

Tue Jan 8 23:47:13 EET 2019

On 1/8/19, Simon Nash <simon at cjnash.com> wrote:
> Hello,
> This is my first message to ffmpeg-devel, so please excuse me if
> anything here doesn't conform to the normal mailing list conventions.
>
> I have encountered a problem with ffmpeg (a segmentation fault) that
> occurs only when running ffmpeg on the Marvell Armada 370 processor.
>
> I used gdb to debug the cause of the problem and found that it was
> caused by a malfunction in the floating-point unit of the Armada 370.
> This is not specific to a single machine but appears to affect all
> Armada 370 chips.
>
> The problem occurs in the following section of code in
> libavfilter/af_afir.c (lines 441 to 447):
>
>     for (ch = 0; ch < ctx->inputs[1]->channels; ch++) {
>         float *time = (float *)s->in[1]->extended_data[!s->one2many * ch];
>
>         for (i = 0; i < s->nb_taps; i++)
>             power += time[i] * time[i];
>     }
>     s->gain = sqrtf(ch / power);
>
> The generated assembler code for the inner loop includes the following:
>
>     0x0018a8d4 <activate+1660>:	add.w	r0, r6, #100	; 0x64
>     0x0018a8d8 <activate+1664>:	movs	r3, #0
>     0x0018a8da <activate+1666>:	vldr	s15, [r0, #-100]	; 0xffffff9c
>     0x0018a8de <activate+1670>:	adds	r3, #8
>     0x0018a8e0 <activate+1672>:	vldr	s8, [r0, #-96]	; 0xffffffa0
>     0x0018a8e4 <activate+1676>:	cmp	r3, lr
>     0x0018a8e6 <activate+1678>:	vldr	s9, [r0, #-92]	; 0xffffffa4
>     0x0018a8ea <activate+1682>:	pld	[r0]
>     0x0018a8ee <activate+1686>:	vldr	s10, [r0, #-88]	; 0xffffffa8
>     0x0018a8f2 <activate+1690>:	vmla.f32	s12, s15, s15
>     0x0018a8f6 <activate+1694>:	vldr	s11, [r0, #-84]	; 0xffffffac
>     0x0018a8fa <activate+1698>:	vldr	s13, [r0, #-80]	; 0xffffffb0
>     0x0018a8fe <activate+1702>:	vldr	s14, [r0, #-76]	; 0xffffffb4
>     0x0018a902 <activate+1706>:	vldr	s15, [r0, #-72]	; 0xffffffb8
>     0x0018a906 <activate+1710>:	add.w	r0, r0, #32
>
> When the 32-bit floating-point multiply instruction
>     0x0018a8f2 <activate+1690>:	vmla.f32	s12, s15, s15
> at activate+1690 is executed, there is a segmentation fault. This
> should never happen because the vmla.f32 instruction uses register
> values only. This problem happens when the value of the operand in
> s15 is small enough to cause an underflow when multiplied by itself
> (7.04930082e-28 in this case). I have a console log showing the
> state of the registers before and after single-stepping through the
> vmla.f32 instruction and I would be pleased to make this available
> in any convenient form.
>
> In other cases (when ffmpeg was compiled with a lower optimization
> level for debugging purposes), the vmla.f32 instruction completes
> but corrupts the unrelated register r7, causing a segmentation fault
> on a subsequent instruction. It seems that vmla.f32 produces
> unpredictable results on the Armada 370 when underflow occurs.
>
> I have investigated possible workarounds. The problem does not occur
> if the 32-bit float value is converted to a 64-bit value and
> multiplied using a vmla.f64 instruction producing a 64-bit result.
> This is because there is no underflow from ther multiplication.
> The following patch (diff file from git) does this. Changing to
> 64-bit multiplication would (I believe) avoid the possibility of
> underflow for any 32-bit coefficient value, which would mean that
> the code works correctly on the Armada 370 as well as all other
> processors.
>
> index 31919f6..3e69107 100644
> --- a/libavfilter/af_afir.c
> +++ b/libavfilter/af_afir.c
> @@ -375,6 +375,7 @@ static int convert_coeffs(AVFilterContext *ctx)
>       int left, offset = 0, part_size, max_part_size;
>       int ret, i, ch, n;
>       float power = 0;
> +    double power_d = 0;
>
>       s->nb_taps = ff_inlink_queued_samples(ctx->inputs[1]);
>       if (s->nb_taps <= 0)
> @@ -441,9 +442,12 @@ static int convert_coeffs(AVFilterContext *ctx)
>           for (ch = 0; ch < ctx->inputs[1]->channels; ch++) {
>               float *time = (float *)s->in[1]->extended_data[!s->one2many *
> ch];
>
> -            for (i = 0; i < s->nb_taps; i++)
> -                power += time[i] * time[i];
> +            for (i = 0; i < s->nb_taps; i++) {
> +                double coeff_d = (double)time[i];
> +                power_d += coeff_d * coeff_d; // ensure no underflow
> +            }
>           }
> +        power = (float)power_d;
>           s->gain = sqrtf(ch / power);
>           break;
>       default:
>
> This produces the following assembler code, with vmla.f32 replaced with
> vmla.f64:
>
>     0x0018a8d4 <activate+1660>:  add.w   r0, r6, #100    ; 0x64
>     0x0018a8d8 <activate+1664>:  movs    r3, #0
>     0x0018a8da <activate+1666>:  vldr    s12, [r0, #-100]        ;
> 0xffffff9c
>     0x0018a8de <activate+1670>:  adds    r3, #8
>     0x0018a8e0 <activate+1672>:  vldr    s0, [r0, #-96]  ; 0xffffffa0
>     0x0018a8e4 <activate+1676>:  cmp     r3, lr
>     0x0018a8e6 <activate+1678>:  vldr    s2, [r0, #-92]  ; 0xffffffa4
>     0x0018a8ea <activate+1682>:  pld     [r0]
>     0x0018a8ee <activate+1686>:  vldr    s4, [r0, #-88]  ; 0xffffffa8
>     0x0018a8f2 <activate+1690>:  vcvt.f64.f32    d6, s12
>     0x0018a8f6 <activate+1694>:  vcvt.f64.f32    d0, s0
>     0x0018a8fa <activate+1698>:  vcvt.f64.f32    d1, s2
>     0x0018a8fe <activate+1702>:  vcvt.f64.f32    d2, s4
>     0x0018a902 <activate+1706>:  vmla.f64        d7, d6, d6
>     0x0018a906 <activate+1710>:  vldr    s6, [r0, #-84]  ; 0xffffffac
>     0x0018a90a <activate+1714>:  vldr    s8, [r0, #-80]  ; 0xffffffb0
>     0x0018a90e <activate+1718>:  vldr    s10, [r0, #-76] ; 0xffffffb4
>     0x0018a912 <activate+1722>:  vldr    s12, [r0, #-72] ; 0xffffffb8
>     0x0018a916 <activate+1726>:  add.w   r0, r0, #32
>
> I have tested this and it works on the Armada 370 as well as on
> other ARMv7 platforms.
>
> The change to using 64-bit multiplication is unlikely to be a
> performance issue because this code runs only during activation.
> However, if there are concerns about using 64-bit multiplication
> on processors that don't have the bug, I can modify the patch to
> check whether the processor is the Armada 370 and use the current
> 32-bit multiplication code when running on other processors.
>
> I can provide a test case if required. This will reproduce the bug
> reliably if ffmpeg is running on an Armada 370 machine.
>
> I realize that this is not a bug in FFmpeg but I would like to ask
> the FFmpeg developers to apply this patch (or equivalent) so that
> ffmpeg is able to run correctly on the Armada 370. My application
> (MinimServer) is using ffmpeg for various audio transformations
> including applying FIR filters and has users with devices that use
> the Armada 370 as well as users on many other platforms.
>
> Please advise me what I should do now to help make this change
> happen. Many thanks.
>
> Best regards,
> Simon

This could use doubles, but perhaps same issue could happen there too
in another scenario.
Others may be against it. So better wait for others opinions.