[FFmpeg-devel] [PATCH 1/6] opus: convert encoder and decoder to lavu/tx

Sun Sep 25 00:57:22 EEST 2022

Sep 24, 2022, 21:40 by martin at martin.st:

> On Sat, 24 Sep 2022, Hendrik Leppkes wrote:
>
>> On Sat, Sep 24, 2022 at 9:26 PM Hendrik Leppkes <h.leppkes at gmail.com> wrote:
>>
>>>
>>> On Sat, Sep 24, 2022 at 8:43 PM Martin Storsjö <martin at martin.st> wrote:
>>> >
>>> > On Sat, 24 Sep 2022, Lynne wrote:
>>> >
>>> > > This commit changes both the encoder and decoder to use the new lavu/tx code,
>>> > > which has faster C transforms and more assembly optimizations.
>>> >
>>> > What's the case of e.g. 32 bit arm - that does have a bunch of fft and
>>> > mdct assembly, but is that something that ends up used by opus today, or
>>> > does the mdct15 stuff use separate codepaths that aren't optimized there
>>> > today yet?
>>> >
>>>
>>> mdct15 only has some x86 assembly, nothing for ARM.
>>> Only the normal (power of 2) fft/mdct has some ARM 32-bit assembly.
>>>
>>
>> Actually, I missed that the mdct15 internally uses one of the normal
>> fft functions for a part of the calculation, but how much impact that
>> has on performance vs. the new code where the C alone is quite a bit
>> faster would have to be confirmed by Lynne.
>>
>
> Ok, fair enough.
>

I did some benchmarking. Just lavc's C nptwo MDCT is 10% slower than lavu's
C nptwo MDCT. I don't have 32bit ARM hardware to test on, but I do have an
aarch64 A53 core. On it, the performance difference with all optimizations with
this patch on or off was that the decoder became 15% faster. With lavu/tx's aarch64
assembly disabled to simulate arm32's situation, the decoder was still 10% faster
overall. It's probably going to be similar on arm32.

On x86, the performance difference between the decoder without this patch
and the decoder with this patch but all lavu/tx asm disabled was only 10% slower.
With assembly enabled and this patch, the decoder is 15% faster overall on an
Alder Lake system.

As for the overall decoding time consumption for Opus, the MDCT is very far behind
the largest overhead - coefficient decoding (on x86 with optimizations, 50% of the
time is spent there, whilst only 5% on the MDCT in total). It's a very optimized decoder.

In general, for the transform alone, a C non-power-of-two lavu MDCT for the lengths
used by Opus, the performance difference for using AVX vs C for the ptwo part is on
the order of 20% slower transforms for 960pt, and SSE vs C for 240pt is also around
20%. Most of this is due to the function call overhead, (framesize/2)/ptwo = 120,
60, 30 and 15 calls to ptwo FFTs per transform. The assembly function largely
eliminates this overhead by linking assembly functions together with a minimal
'ABI'.

> What about ac3dsp then - that one seems like it's fairly optimized for arm?
>

Haven't touched them, they're still being used. Unfortunately, for AC3,
the full MDCT optimizations in lavc do make a difference and the overall
decoder becomes 15% slower with this patch on for aarch64 with lavu/tx's
asm disabled and 7% slower with lavu/tx's asm enabled. I do plan to write
an aarch64 MDCT NEON SIMD code in a month or so, unless someone is faster,
which should make the decoder at least 10% faster with lavu/tx.

For Opus, the used ptwo lengths are (framesize/2)/15 = 32, 16, 8 and 4pt FFTs.
If you'd like to help out, I've documented the C factorizations used in
docs/transforms.md. You could also try porting the existing assembly. It should be
trivial if they don't use the upper half of the tables. lavc's and lavu's FFT tables
differ by size - lavu's are half the size of lavc's tables, because lavc's tables
contain the multiplication factors mirrored after the halfway point. That's used by
the RDFT, and by the x86 assembly. It's not worth replicating this, the
memory overhead is just too much, especially on bandwidth starved cores.
If the arm32 assembly uses the upper part, it shouldn't be too hard to
make it read from both the start and end point of the exptab array in the
recombination function of ptwo transforms.

The MDCT asm can be ported in a straightforward way and would improve
both decoders significantly. If the ABI is simpler than x86's, you could even
make the asm transform call into C functions, which would lessen the work.
A lot of the MDCT overhead is in the gather and multiplication part, whilst
the FFT is limited by mostly adds and memory bandwidth, so just with
MDCT assembly the decoder would get a lot faster.