[FFmpeg-devel] Armada 370 problem causes ffmpeg segmentation fault

Paul B Mahol onemda at gmail.com
Tue Jan 8 23:47:13 EET 2019


On 1/8/19, Simon Nash <simon at cjnash.com> wrote:
> Hello,
> This is my first message to ffmpeg-devel, so please excuse me if
> anything here doesn't conform to the normal mailing list conventions.
>
> I have encountered a problem with ffmpeg (a segmentation fault) that
> occurs only when running ffmpeg on the Marvell Armada 370 processor.
>
> I used gdb to debug the cause of the problem and found that it was
> caused by a malfunction in the floating-point unit of the Armada 370.
> This is not specific to a single machine but appears to affect all
> Armada 370 chips.
>
> The problem occurs in the following section of code in
> libavfilter/af_afir.c (lines 441 to 447):
>
>     for (ch = 0; ch < ctx->inputs[1]->channels; ch++) {
>         float *time = (float *)s->in[1]->extended_data[!s->one2many * ch];
>
>         for (i = 0; i < s->nb_taps; i++)
>             power += time[i] * time[i];
>     }
>     s->gain = sqrtf(ch / power);
>
> The generated assembler code for the inner loop includes the following:
>
>     0x0018a8d4 <activate+1660>:	add.w	r0, r6, #100	; 0x64
>     0x0018a8d8 <activate+1664>:	movs	r3, #0
>     0x0018a8da <activate+1666>:	vldr	s15, [r0, #-100]	; 0xffffff9c
>     0x0018a8de <activate+1670>:	adds	r3, #8
>     0x0018a8e0 <activate+1672>:	vldr	s8, [r0, #-96]	; 0xffffffa0
>     0x0018a8e4 <activate+1676>:	cmp	r3, lr
>     0x0018a8e6 <activate+1678>:	vldr	s9, [r0, #-92]	; 0xffffffa4
>     0x0018a8ea <activate+1682>:	pld	[r0]
>     0x0018a8ee <activate+1686>:	vldr	s10, [r0, #-88]	; 0xffffffa8
>     0x0018a8f2 <activate+1690>:	vmla.f32	s12, s15, s15
>     0x0018a8f6 <activate+1694>:	vldr	s11, [r0, #-84]	; 0xffffffac
>     0x0018a8fa <activate+1698>:	vldr	s13, [r0, #-80]	; 0xffffffb0
>     0x0018a8fe <activate+1702>:	vldr	s14, [r0, #-76]	; 0xffffffb4
>     0x0018a902 <activate+1706>:	vldr	s15, [r0, #-72]	; 0xffffffb8
>     0x0018a906 <activate+1710>:	add.w	r0, r0, #32
>
> When the 32-bit floating-point multiply instruction
>     0x0018a8f2 <activate+1690>:	vmla.f32	s12, s15, s15
> at activate+1690 is executed, there is a segmentation fault. This
> should never happen because the vmla.f32 instruction uses register
> values only. This problem happens when the value of the operand in
> s15 is small enough to cause an underflow when multiplied by itself
> (7.04930082e-28 in this case). I have a console log showing the
> state of the registers before and after single-stepping through the
> vmla.f32 instruction and I would be pleased to make this available
> in any convenient form.
>
> In other cases (when ffmpeg was compiled with a lower optimization
> level for debugging purposes), the vmla.f32 instruction completes
> but corrupts the unrelated register r7, causing a segmentation fault
> on a subsequent instruction. It seems that vmla.f32 produces
> unpredictable results on the Armada 370 when underflow occurs.
>
> I have investigated possible workarounds. The problem does not occur
> if the 32-bit float value is converted to a 64-bit value and
> multiplied using a vmla.f64 instruction producing a 64-bit result.
> This is because there is no underflow from ther multiplication.
> The following patch (diff file from git) does this. Changing to
> 64-bit multiplication would (I believe) avoid the possibility of
> underflow for any 32-bit coefficient value, which would mean that
> the code works correctly on the Armada 370 as well as all other
> processors.
>
> index 31919f6..3e69107 100644
> --- a/libavfilter/af_afir.c
> +++ b/libavfilter/af_afir.c
> @@ -375,6 +375,7 @@ static int convert_coeffs(AVFilterContext *ctx)
>       int left, offset = 0, part_size, max_part_size;
>       int ret, i, ch, n;
>       float power = 0;
> +    double power_d = 0;
>
>       s->nb_taps = ff_inlink_queued_samples(ctx->inputs[1]);
>       if (s->nb_taps <= 0)
> @@ -441,9 +442,12 @@ static int convert_coeffs(AVFilterContext *ctx)
>           for (ch = 0; ch < ctx->inputs[1]->channels; ch++) {
>               float *time = (float *)s->in[1]->extended_data[!s->one2many *
> ch];
>
> -            for (i = 0; i < s->nb_taps; i++)
> -                power += time[i] * time[i];
> +            for (i = 0; i < s->nb_taps; i++) {
> +                double coeff_d = (double)time[i];
> +                power_d += coeff_d * coeff_d; // ensure no underflow
> +            }
>           }
> +        power = (float)power_d;
>           s->gain = sqrtf(ch / power);
>           break;
>       default:
>
> This produces the following assembler code, with vmla.f32 replaced with
> vmla.f64:
>
>     0x0018a8d4 <activate+1660>:  add.w   r0, r6, #100    ; 0x64
>     0x0018a8d8 <activate+1664>:  movs    r3, #0
>     0x0018a8da <activate+1666>:  vldr    s12, [r0, #-100]        ;
> 0xffffff9c
>     0x0018a8de <activate+1670>:  adds    r3, #8
>     0x0018a8e0 <activate+1672>:  vldr    s0, [r0, #-96]  ; 0xffffffa0
>     0x0018a8e4 <activate+1676>:  cmp     r3, lr
>     0x0018a8e6 <activate+1678>:  vldr    s2, [r0, #-92]  ; 0xffffffa4
>     0x0018a8ea <activate+1682>:  pld     [r0]
>     0x0018a8ee <activate+1686>:  vldr    s4, [r0, #-88]  ; 0xffffffa8
>     0x0018a8f2 <activate+1690>:  vcvt.f64.f32    d6, s12
>     0x0018a8f6 <activate+1694>:  vcvt.f64.f32    d0, s0
>     0x0018a8fa <activate+1698>:  vcvt.f64.f32    d1, s2
>     0x0018a8fe <activate+1702>:  vcvt.f64.f32    d2, s4
>     0x0018a902 <activate+1706>:  vmla.f64        d7, d6, d6
>     0x0018a906 <activate+1710>:  vldr    s6, [r0, #-84]  ; 0xffffffac
>     0x0018a90a <activate+1714>:  vldr    s8, [r0, #-80]  ; 0xffffffb0
>     0x0018a90e <activate+1718>:  vldr    s10, [r0, #-76] ; 0xffffffb4
>     0x0018a912 <activate+1722>:  vldr    s12, [r0, #-72] ; 0xffffffb8
>     0x0018a916 <activate+1726>:  add.w   r0, r0, #32
>
> I have tested this and it works on the Armada 370 as well as on
> other ARMv7 platforms.
>
> The change to using 64-bit multiplication is unlikely to be a
> performance issue because this code runs only during activation.
> However, if there are concerns about using 64-bit multiplication
> on processors that don't have the bug, I can modify the patch to
> check whether the processor is the Armada 370 and use the current
> 32-bit multiplication code when running on other processors.
>
> I can provide a test case if required. This will reproduce the bug
> reliably if ffmpeg is running on an Armada 370 machine.
>
> I realize that this is not a bug in FFmpeg but I would like to ask
> the FFmpeg developers to apply this patch (or equivalent) so that
> ffmpeg is able to run correctly on the Armada 370. My application
> (MinimServer) is using ffmpeg for various audio transformations
> including applying FIR filters and has users with devices that use
> the Armada 370 as well as users on many other platforms.
>
> Please advise me what I should do now to help make this change
> happen. Many thanks.
>
> Best regards,
> Simon

This could use doubles, but perhaps same issue could happen there too
in another scenario.
Others may be against it. So better wait for others opinions.


More information about the ffmpeg-devel mailing list