[FFmpeg-devel] [PATCH] split-radix FFT
Michael Niedermayer
michaelni
Thu Aug 7 20:38:01 CEST 2008
On Mon, Aug 04, 2008 at 06:04:57PM -0600, Loren Merritt wrote:
> On Fri, 25 Jul 2008, Loren Merritt wrote:
>
>> $subject, vaguely based on djbfft.
>> Changed from djb:
>> * added simd.
>> * removed the hand-scheduled pentium-pro code. gcc's output from simple C
>> is better on all cpus I have access to.
>> * removed the distinction between fft and ifft. they're just permutations
>> of eachother, so the difference belongs in revtab[] and not in the code.
>> * removed the distinction between pass() and pass_big(). C can always use
>> the memory-efficient version, and simd never does because the shuffles are
>> too costly.
>> * made an entirely different pass_big(), to avoid store->load aliasing.
>
> yasm version.
>
> Not nasm comptabile. In particular, I depend on the assembler to optimize
> away reg*0 in an address, which nasm does in some cases but not if there
> were 3 registers before the zero culling. I could workaround this at a cost
> of about 10 lines of code.
>
> Doesn't distinguish HAVE_YASM from HAVE_MMX.
that is a problem but i think you fixed that alraedy in the partial update?
[...]
[build system stuff i do not maintain]
> +static int split_radix_permutation(int i, int n, int inverse)
> +{
> + int m;
> + if(n <= 2) return i;
> + m = n >> 1;
> + if(i < m) return split_radix_permutation(i,m,inverse) << 1;
> + i -= m;
> + m >>= 1;
> + if(!inverse) i ^= m;
> + if(i < m) return (split_radix_permutation(i,m,inverse) << 2) + 1;
> + i -= m;
> + return ((split_radix_permutation(i,m,inverse) << 2) - 1) & (n - 1);
iam not sure if its worth it to simplify this, but i think if we dont attempt
to mask of the high bits inside the function then the following might work:
if(!(i & m)) return split_radix_permutation(i, m, inverse)<<1;
m >>= 1;
if(inverse == !(i&m)) return (split_radix_permutation(i, m, inverse)<<2) + 1;
else return (split_radix_permutation(i, m, inverse)<<2) - 1;
[...]
> /* compute constant table for HAVE_SSE version */
> - if (shuffle) {
> + if (split_radix) {
> + for(j=4; j<=nbits; j++) {
> + int m = 1<<j;
> + double freq = 2*M_PI/m;
> + FFTSample *tab = ff_cos_tabs[j-4];
> + for(i=0; i<=m/4; i++)
> + tab[i] = cos(i*freq);
> + for(i=1; i<m/4; i++)
> + tab[m/2-i] = tab[i];
> + }
> + for(i=0; i<n; i++)
> + s->revtab[(n - split_radix_permutation(i, n, s->inverse)) % n] = i;
s->revtab[(-split_radix_permutation(i, n, s->inverse)) & (n-1)] = i;
> + s->tmp_buf = av_malloc(n * sizeof(FFTComplex));
> + } else {
> int np, nblocks, np2, l;
> FFTComplex *q;
>
> + for(i=0; i<(n/2); i++) {
> + alpha = 2 * M_PI * (float)i / (float)n;
the float casts seem uneeded, but as this is just moved around thats a
seperate issue.
[...]
> +/* z[0...8n-1], w[1...2n-1] */
> +#define PASS(name)\
> +static void name(FFTComplex *z, const FFTSample *wre, unsigned int n)\
> +{\
> + FFTSample t1, t2, t3, t4, t5, t6;\
> + int o1 = 2*n;\
> + int o2 = 4*n;\
> + int o3 = 6*n;\
> + const FFTSample *wim = wre+o1;\
> + n--;\
> +\
> + TRANSFORM_ZERO(z[0],z[o1],z[o2],z[o3]);\
> + TRANSFORM(z[1],z[o1+1],z[o2+1],z[o3+1],wre[1],wim[-1]);\
> + do {\
> + z += 2;\
> + wre += 2;\
> + wim -= 2;\
> + TRANSFORM(z[0],z[o1],z[o2],z[o3],wre[0],wim[0]);\
> + TRANSFORM(z[1],z[o1+1],z[o2+1],z[o3+1],wre[1],wim[-1]);\
> + } while(--n);\
> +}
> +
> +PASS(pass)
> +#undef BUTTERFLIES
> +#define BUTTERFLIES BUTTERFLIES_BIG
> +PASS(pass_big)
> +
> +#define DECL_FFT(n,n2,n4)\
> +static void fft##n(FFTComplex *z)\
> +{\
> + fft##n2(z);\
> + fft##n4(z+n4*2);\
> + fft##n4(z+n4*3);\
> + pass(z,ff_cos_##n,n4/2);\
> +}
> +
> +DECL_FFT(32,16,8)
> +DECL_FFT(64,32,16)
> +DECL_FFT(128,64,32)
> +DECL_FFT(256,128,64)
> +DECL_FFT(512,256,128)
> +#define pass pass_big
> +DECL_FFT(1024,512,256)
> +DECL_FFT(2048,1024,512)
> +DECL_FFT(4096,2048,1024)
> +DECL_FFT(8192,4096,2048)
> +DECL_FFT(16384,8192,4096)
> +DECL_FFT(32768,16384,8192)
> +DECL_FFT(65536,32768,16384)
> +#undef pass
> +
> +static void (*fft_dispatch[])(FFTComplex*) = {
> + fft4, fft8, fft16, fft32, fft64, fft128, fft256, fft512, fft1024,
> + fft2048, fft4096, fft8192, fft16384, fft32768, fft65536,
> +};
It would be nice if the forced duplication could be limited to
#ifndef CONFIG_SMALL unless its significantly slower that way
[...]
> + int n = 1<<s->nbits;
> + int i;
> + ff_fft_dispatch_3dn2(z, s->nbits);
> asm volatile("femms");
> + for(i=0; i<n; i+=2)
> + FFSWAP(FFTSample, z[i].im, z[i+1].re);
> }
could you elaborate on why this FFSWAP pass is needed?
[...]
> +align 8
> +dispatch_tab%3%2: pointer list_of_fft
> +
> +; on x86_32, this function does the register saving and restoring for all of fft
> +; the others pass args in registers and don't spill anything
> +cglobal ff_fft_dispatch%3%2, 2,5,0, z, nbits
> + lea r2, [dispatch_tab%3%2 GLOBAL]
> + mov r2, [r2 + (nbitsq-2)*gprsize]
position independant code right after a table that needs relocations ...
no complaint i just find it ironic
and iam fine with the rest
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Old school: Use the lowest level language in which you can solve the problem
conveniently.
New school: Use the highest level language in which the latest supercomputer
can solve the problem without the user falling asleep waiting.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/attachments/20080807/969a51fa/attachment.pgp>
More information about the ffmpeg-devel
mailing list