[FFmpeg-devel] [PATCH] split-radix FFT

Thu Aug 7 20:38:01 CEST 2008

On Mon, Aug 04, 2008 at 06:04:57PM -0600, Loren Merritt wrote:
> On Fri, 25 Jul 2008, Loren Merritt wrote:
>
>> $subject, vaguely based on djbfft.
>> Changed from djb:
>> * added simd.
>> * removed the hand-scheduled pentium-pro code. gcc's output from simple C 
>> is better on all cpus I have access to.
>> * removed the distinction between fft and ifft. they're just permutations 
>> of eachother, so the difference belongs in revtab[] and not in the code.
>> * removed the distinction between pass() and pass_big(). C can always use 
>> the memory-efficient version, and simd never does because the shuffles are 
>> too costly.
>> * made an entirely different pass_big(), to avoid store->load aliasing.
>
> yasm version.
>
> Not nasm comptabile. In particular, I depend on the assembler to optimize 
> away reg*0 in an address, which nasm does in some cases but not if there 
> were 3 registers before the zero culling. I could workaround this at a cost 
> of about 10 lines of code.
>

> Doesn't distinguish HAVE_YASM from HAVE_MMX.

that is a problem but i think you fixed that alraedy in the partial update?

[...]

[build system stuff i do not maintain]

> +static int split_radix_permutation(int i, int n, int inverse)
> +{
> +    int m;
> +    if(n <= 2) return i;
> +    m = n >> 1;

> +    if(i < m) return split_radix_permutation(i,m,inverse) << 1;
> +    i -= m;
> +    m >>= 1;
> +    if(!inverse) i ^= m;
> +    if(i < m) return (split_radix_permutation(i,m,inverse) << 2) + 1;
> +    i -= m;
> +    return ((split_radix_permutation(i,m,inverse) << 2) - 1) & (n - 1);

iam not sure if its worth it to simplify this, but i think if we dont attempt
to mask of the high bits inside the function then the following might work:

if(!(i & m))          return  split_radix_permutation(i, m, inverse)<<1;
m >>= 1;
if(inverse == !(i&m)) return (split_radix_permutation(i, m, inverse)<<2) + 1;
else                  return (split_radix_permutation(i, m, inverse)<<2) - 1;

[...]

>      /* compute constant table for HAVE_SSE version */
> -    if (shuffle) {
> +    if (split_radix) {
> +        for(j=4; j<=nbits; j++) {
> +            int m = 1<<j;
> +            double freq = 2*M_PI/m;
> +            FFTSample *tab = ff_cos_tabs[j-4];
> +            for(i=0; i<=m/4; i++)
> +                tab[i] = cos(i*freq);
> +            for(i=1; i<m/4; i++)
> +                tab[m/2-i] = tab[i];
> +        }
> +        for(i=0; i<n; i++)

> +            s->revtab[(n - split_radix_permutation(i, n, s->inverse)) % n] = i;

s->revtab[(-split_radix_permutation(i, n, s->inverse)) & (n-1)] = i;

> +        s->tmp_buf = av_malloc(n * sizeof(FFTComplex));
> +    } else {
>          int np, nblocks, np2, l;
>          FFTComplex *q;
>  

> +        for(i=0; i<(n/2); i++) {
> +            alpha = 2 * M_PI * (float)i / (float)n;

the float casts seem uneeded, but as this is just moved around thats a
seperate issue.

[...]
> +/* z[0...8n-1], w[1...2n-1] */
> +#define PASS(name)\
> +static void name(FFTComplex *z, const FFTSample *wre, unsigned int n)\
> +{\
> +    FFTSample t1, t2, t3, t4, t5, t6;\
> +    int o1 = 2*n;\
> +    int o2 = 4*n;\
> +    int o3 = 6*n;\
> +    const FFTSample *wim = wre+o1;\
> +    n--;\
> +\
> +    TRANSFORM_ZERO(z[0],z[o1],z[o2],z[o3]);\
> +    TRANSFORM(z[1],z[o1+1],z[o2+1],z[o3+1],wre[1],wim[-1]);\
> +    do {\
> +        z += 2;\
> +        wre += 2;\
> +        wim -= 2;\
> +        TRANSFORM(z[0],z[o1],z[o2],z[o3],wre[0],wim[0]);\
> +        TRANSFORM(z[1],z[o1+1],z[o2+1],z[o3+1],wre[1],wim[-1]);\
> +    } while(--n);\
> +}
> +
> +PASS(pass)
> +#undef BUTTERFLIES
> +#define BUTTERFLIES BUTTERFLIES_BIG
> +PASS(pass_big)
> +
> +#define DECL_FFT(n,n2,n4)\
> +static void fft##n(FFTComplex *z)\
> +{\
> +    fft##n2(z);\
> +    fft##n4(z+n4*2);\
> +    fft##n4(z+n4*3);\
> +    pass(z,ff_cos_##n,n4/2);\
> +}
> +
> +DECL_FFT(32,16,8)
> +DECL_FFT(64,32,16)
> +DECL_FFT(128,64,32)
> +DECL_FFT(256,128,64)
> +DECL_FFT(512,256,128)
> +#define pass pass_big
> +DECL_FFT(1024,512,256)
> +DECL_FFT(2048,1024,512)
> +DECL_FFT(4096,2048,1024)
> +DECL_FFT(8192,4096,2048)
> +DECL_FFT(16384,8192,4096)
> +DECL_FFT(32768,16384,8192)
> +DECL_FFT(65536,32768,16384)
> +#undef pass
> +
> +static void (*fft_dispatch[])(FFTComplex*) = {
> +    fft4, fft8, fft16, fft32, fft64, fft128, fft256, fft512, fft1024,
> +    fft2048, fft4096, fft8192, fft16384, fft32768, fft65536,
> +};

It would be nice if the forced duplication could be limited to
#ifndef CONFIG_SMALL unless its significantly slower that way

[...]
> +    int n = 1<<s->nbits;
> +    int i;
> +    ff_fft_dispatch_3dn2(z, s->nbits);
>      asm volatile("femms");
> +    for(i=0; i<n; i+=2)
> +        FFSWAP(FFTSample, z[i].im, z[i+1].re);
>  }

could you elaborate on why this FFSWAP pass is needed?

[...]
> +align 8
> +dispatch_tab%3%2: pointer list_of_fft
> +
> +; on x86_32, this function does the register saving and restoring for all of fft
> +; the others pass args in registers and don't spill anything
> +cglobal ff_fft_dispatch%3%2, 2,5,0, z, nbits
> +    lea r2, [dispatch_tab%3%2 GLOBAL]
> +    mov r2, [r2 + (nbitsq-2)*gprsize]

position independant code right after a table that needs relocations ...
no complaint i just find it ironic 

and iam fine with the rest

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Old school: Use the lowest level language in which you can solve the problem
            conveniently.
New school: Use the highest level language in which the latest supercomputer
            can solve the problem without the user falling asleep waiting.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080807/969a51fa/attachment.pgp>