sky blue trades

Haskell FFT 9: Prime-Length FFT With Rader's Algorithm

[The code from this article is available as a Gist]

In the last article in this series, benchmarking results gave a pretty good indication that we needed to do something about prime-length FFTs if we want to get good scaling for all input sizes. Falling back on the O(N2) DFT algorithm just isn’t good enough.

In this article, we’re going to look at an implementation of Rader’s FFT algorithm for prime-length inputs. Some aspects of this algorithm are not completely straightforward to understand since they rely on a couple of results from number theory and group theory. We’ll try to use small examples to motivate what’s going on.

Before we begin exploring Rader’s algorithm, it’s probably worth taking a minute to look at why we might expect there to be a better algorithm for prime-length FFTs than the naïve O(N2) DFT algorithm we’ve been using so far. Throughout this article, we’re going to use F5 and F7 as running examples (you’ll see why later). Here’s F5:


and here’s F7:


The main thing to notice is that if you look at the sub-matrix that excludes the top row and left column (both of which have entries that are all ones) from F5 and F7, you’ll see a matrix each of whose rows and columns is a permutation of the values ωN1,ωN2,,ωNN1. This is because each row of FN is of the form (ωN0k,ωN1k,ωN2k,,ωN(N1)k) for 1kN1. For prime N, none of these k values divides N exactly, so that there are no “repeats” in the powers of ωN.

So, there’s definitely something a little bit special about FN for prime N. To make effective use of this to produce a prime-length FFT requires some non-obvious insight.

Throughout the remainder of this article, we’ll use p to denote some prime number, just to avoid repeatedly writing “for some prime N”.

Rader’s algorithm

The basic idea of Rader’s algorithm is to make use of the special permutation structure of the powers of ωp in Fp to write the DFT summation


as a convolution, which can then be calculated via FFTs of non-prime lengths. We’ll step through how this works, leaving a couple of details aside until we come to the implementation.

Separation of zero-index values

The first thing we do is to split out the special zero-index parts of the Fourier matrix calculation, so that (1) becomes


We thus need to work out an efficient way of calculating the sum in the second expression.

Multiplicative group modulo n

The most common approach to thinking of a group structure for integers modulo some number n is to use addition as the group operation, giving the group /n with group elements 0,1,2,,n1. A less common approach is to think of the integers in 1,2,,n1 that are prime relative to n with the group operation being multiplication modulo n. This group is denoted by a range of different notations, but we’ll call it n×. For prime n, all of the numbers 1,2,,n1 are elements of n×. For example, for n=5 and n=7, we have the following group multiplication tables:

1 2 3 4
1 1 2 3 4
2 2 4 1 3
3 3 1 4 2
4 4 3 2 1

1 2 3 4 5 6
1 1 2 3 4 5 6
2 2 4 6 1 3 5
3 3 6 2 5 1 4
4 4 1 5 2 6 3
5 5 3 1 6 4 2
6 6 5 4 3 2 1

This group turns out to be the key to writing the sum in (2) as a convolution, since both n and k range over the values 1,2,,p1 and the multiplication in the power of ωp is of necessity calculated modulo p (since ωpp=1).

Group generators

The group p× is cyclic, which means that any element of the group a can be written as an integer power of a single element of the group, the group generator g, i.e. a=gi for some unique positive integer i with 0ip2, and equivalently a=gj for some unique positive integer j with 0jp2, where negative powers of g denote powers of the multiplicative inverse (modulo p) of g. For example, for 5×, either 2 or 3 is a generator:


In this case, 2=31 and 3=21. For 7×, a suitable generator is 3, and


In this case, 5=31.

We’ll talk about how we determine the group generator g later. For the moment, assume that we know a suitable generator.

Representation as convolution

Given a generator g for the group p×, we can write the second equation in (2) as


(This relies on the fact that hk and Hk are both periodic in p and ωpp=1.)

For example, if p=5, taking g=2, we have:


and comparison of the right hand sides of each of these equations with the rows of F5 shows that this generator-based approach replicates the rows of the Fourier matrix, albeit in a different order to the “normal” order.

The summation in the expression (3) is in the form of a cyclic convolution of the two sequences aq=hgq and bq=ωpgq, both of length p1 with 0qp2. For example, for p=5, taking g=2:

q   0     1     2     3  
aq h1 h2 h4 h3
bq ω51 ω53 ω54 ω52

and for p=7 with g=3:

q   0     1     2     3     4     5  
aq h1 h3 h2 h6 h4 h5
bq ω71 ω75 ω74 ω76 ω72 ω73

Calculation of convolution using FFT

Recall that, for a continuous convolution of the form


we have the convolution theorem:


where we denote the Fourier transform of a function f by (f). A comparable result applies for the discrete cyclic convolution in (3). Let us denote the sequence aq defined above as h~=(hgq) and the sequence bq as ω~p=(ωpgq), both for 0qp2, and let us write H~=(Hgq) for the same range of q to represent the results of the convolution as a sequence. Then:


The discrete Fourier transform of the prime-length input can thus be calculated by:

  1. Reordering the input vector according to the indexing scheme for the sequence h~;
  2. Calculating the DFT of h~;
  3. Multiplying DFT[h~] pointwise by the DFT of the reordered sequence of powers of ωp represented by the sequence ω~p;
  4. Calculating the inverse DFT of this product;
  5. Reordering the result according to the indexing scheme for the sequence H~ and adding in the DC offset h0.

The convolution derived above is of length p1, which, since p is prime, must be composite. We can thus calculate the length p1 discrete Fourier transforms required by the above approach using our Cooley-Tukey divide and conquer algorithm. The index reorderings and the DFT of the sequence of powers of ωp can all be determined in advance since they depend only on p.

For p=5, taking g=2, this approach looks like this:

  1. Reorder the input vector to give h~=(h1,h2,h4,h3).
  2. Calculate DFT[h~] using a conventional FFT algorithm; in this case, this is efficient because the length of h~ is a power of two.
  3. Multiply DFT[h~] pointwise by DFT[ω~5], which can be pre-computed: ω~5=(ω51,ω53,ω54,ω52).
  4. Calculate the inverse DFT of the pointwise product to give H~ – this is again an efficient calculation because the input length is a power of two.
  5. Add the DC offset and extract the final results from H~=(H1,H3,H4,H2).

Padding for efficiency

For p=5, p1 is a power of two, which means that the FFT calculations needed in the Rader algorithm are as efficient as possible. However, if p1 itself has large prime factors, it may prove necessary to apply Rader’s algorithm recursively, which will be much less efficient than a direct recursive FFT.

An alternative (and better) approach is provided by zero-padding the sequences involved in the convolution – by padding to a length that is a power of two, recursive application of Rader’s algorithm is avoided. To do this, let us write the two sequences to be convolved as f[i] and g[i], both of length M1 (we write M1 here to match the convolution length p1 in the FFT calculation, and we use the square bracket notation for indexing to make some of the expressions below less ambiguous). The cyclic convolution of these two sequences is


where all indexes on the sequences f[i] and g[i] are zero-based and are taken modulo M.

We produce new sequences f[j] and g[j] of length M where M2M3. This condition on M is required to avoid “interference” between unrelated elements of the original sequences due to the wrap-around of the cyclic convolution that we’re going to compute. In our FFT application, we will choose M to be a power of two, but any value works as long as it is large enough to satisfy this condition. The sequence f[j] is constructed by inserting MM zeroes between the zeroth and first element of f[i]. Sequence g[j] is constructed by cyclically repeating the values of g[i] to give a sequence of total length M. If we now convolve the sequences f[j] and g[j], we find that the result contains the convolution of the original sequences f[i] and g[i] as its first M1 elements.

A minimal example gives a feeling for why this works. Suppose that we set M=4 (so the sequences are of length 3) and consider


The cyclic convolution of these sequences is found as:


The condition on M is that M2M3=5. Putting M=5, the new sequences are then


and the first three elements of the cyclic convolution of these new sequences are:


We see that these are identical to the values found from the original sequences f[i] and g[i].

To make use of this result in Rader’s algorithm, we just choose the smallest power of two that satisfies the length condition on the sequences to be convolved, pad the input sequence f[i] with zeroes and convolve with the repeated ωp sequence. (Most of the details can be pre-computed as part of the FFT “planning” phase.)

Implementating Rader’s algorithm

In this section, we’ll look at a basic Haskell implementation of Rader’s algorithm. The implementation here is structured more to be understandable than to be particularly efficient. We’ll reorganise the code to pre-compute some things for efficiency when we start optimising the overall FFT code later on.

After describing the details of the main algorithm, we’ll look at the approach used for the calculation of primitive roots of p to determine generators of the group p×, since this requires a little explanation. Finally, we’ll look at some test results. Here’s the code for our Haskell implementation (we’ll refer to the line numbers in what follows):

-- | FFT and inverse FFT drivers for Rader's algorithm.
raderFFT, raderIFFT :: VCD -> VCD
raderFFT = raderFFT' 1 1
raderIFFT v = raderFFT' (-1) (1.0 / (fromIntegral $ length v)) v

-- | Rader's algorithm for prime-length complex-to-complex Fast
-- Fourier Transform.
raderFFT' :: Int -> Double -> VCD -> VCD
raderFFT' sign scale xs
  | isPrime p = map ((scale :+ 0) *) $ generate p $ \idx -> case idx of
    0 -> sum xs
    _ -> xs ! 0 + convmap M.! idx
  | otherwise = error "non-prime length in raderFFT"
  where p = length xs
        p1 = p - 1
        -- ^ Convolution length.
        p1pad = if p1 == 2^(log2 p1)
                then p1
                else 2 ^ (1 + log2 (2 * p1 - 3))
        -- ^ Convolution length padded to next greater power of two.
        g = primitiveRoot p
        -- ^ Group generator.
        ig = invModN p g
        -- ^ Inverse group generator.
        as = backpermute xs $ iterateN p1 (\n -> (g * n) `mod` p) 1
        -- ^ Input values permuted according to group generator
        -- indexing.
        pad = p1pad - p1
        -- ^ Padding size.
        apad = generate p1pad $
               \idx -> if idx == 0 then as ! 0
                       else if idx > pad then as ! (idx - pad) else 0
        -- ^ Permuted input vector padded to next greater power of two
        -- size for fast convolution.
        iidxs = iterateN p1 (\n -> (ig * n) `mod` p) 1
        -- ^ Index vector based on inverse group generator ordering.
        w = omega p
        bs = backpermute (map ((w ^^) . (sign *)) $ enumFromTo 0 p1) iidxs
        -- ^ Root of unity powers based on inverse group generator
        -- indexing.
        bpad = generate p1pad (\idx -> bs ! (idx `mod` p1))
        -- ^ Root of unity powers cyclically repeated to make vector
        -- of next power of two length for fast convolution.
        conv = ifft $ zipWith (*) (fft apad) (fft bpad)
        -- ^ FFT-based convolution calculation.
        convmap = M.fromList $ toList $ zip iidxs conv
        -- ^ Map constructed to enable inversion of inverse group
        -- generator index ordering for output.

Overall driver

As for the earlier mixed-radix FFT, we have top-level functions to perform forward and inverse FFTs (lines 2-4), both of which call a common driver function (raderFFT'), passing in sign and scale parameters to control the direction and final output scaling of the transform.

The final generation of the transformed output (lines 10-12) applies the appropriate output scaling to a vector constructed from the sum of the input elements (index 0, corresponding to H0 in (2)) and (other indexes) a DC offset (h0 in (2)) and the appropriate element of the generator-based convolution. In order to deal with the generator-based indexing for the H values in (3), the convolution results are put into a map (convmap: line 46) from where they can easily be extracted in output index order.

Index organisation

The determination of the indexes needed for the generator-based indexing scheme in (3) is done by:

  1. Calculating the generator g for the group p× where p is the prime input vector length (line 21: see below for details);
  2. Permuting the input vector elements according to powers of the group generator (line 25: this is the computation of the sequence aq for the convolution using the indexing defined in (3));
  3. Calculating the inverse element of the generator in p× (line 23: the invModN function finds the multiplicative inverse modulo N of a given integer argument);
  4. Calculating the inverse generator-based indexes needed for permuting both the powers of ωp and the output elements (iidxs: line 35).

The iidxs index vector is used to permute the powers of ωp (line 38) producing the bq sequence used in the convolution of (3) and is also used (line 46) to build the index map used to extract result values from the result of the convolution.

Padding to power-of-two length

If p1 is a power of two, no padding of the aq and bq sequences is needed for efficient calculation of the convolution (3). In all other cases, both sequences need to be padded to a suitable power of two. Calculation of the padded length (lines 17-19) take account of the minimum size requirement on the padded inputs to the convolution to avoid “wrap-around” effects, and this length is used to control the generation of padded versions of aq (apad: lines 30-32) and bq (bpad: line 41). (The “padded” version of bq is just a cyclic repeat of the values calculated in line 38.)


Once suitably padded versions of the aq and bq sequences are computed, the actual convolution step is simple (line 44). Both sequences are Fourier transformed (recall that the padded sequence lengths are a power of two, so this is an efficient computation), multiplied together pointwise and inverse transformed. The convolution of the original unpadded sequences is then found in the first p1 entries of this result — this is what is picked out by the indexing defined by convmap, defined in line 46.

Primitive root calculation

An important step in Rader’s prime length FFT algorithm is the determination of the generator of the group p×. For this group, the group generator is called the primitive root modulo p.

There is no simple general formula to compute primitive roots modulo n, but there are methods to determine a primitive root faster than simply trying out all candidate values. In our case, we’re only ever going to be dealing with primitive roots of prime p, which simplifies things a little. We proceed by first calculating φ(p), where φ is Euler’s totient function. For prime p, φ(p)=p1. We then determine the distinct prime factors of φ(p), which we’ll call p1,p2,,pk. Then, for each mp×, we compute

mφ(p)/pimodnfor i=1,,k

using a fast algorithm for modular exponentiation (we use exponentiation by squaring). A number m for which these k values are all different from 1 is a primitive root.

The implementation is pretty much just a straightforward transcription into Haskell of the description above. In cases where there multiple primitive roots (the number of primitive roots modulo n, if there are any, is φ(φ(n))), we just take the first one:

primitiveRoot :: Int -> Int
primitiveRoot p
  | isPrime p =
    let tot = p - 1
        -- ^ Euler's totient function for prime values.
        totpows = map (tot `div`) $ fromList $ nub $ toList $ factors tot
        -- ^ Powers to check.
        check n = all (/=1) $ map (expt p n) totpows
        -- ^ All powers are different from 1 => primitive root.
    in fromIntegral $ head $ dropWhile (not . check) $ fromList [1..p-1]
  | otherwise = error "Attempt to take primitive root of non-prime value"

-- | Fast exponentation modulo n by squaring.
expt :: Int -> Int -> Int -> Int
expt n b pow = fromIntegral $ go pow
  where bb = fromIntegral b
        nb = fromIntegral n
        go :: Int -> Integer
        go p
          | p == 0 = 1
          | p `mod` 2 == 1 = (bb * go (p - 1)) `mod` nb
          | otherwise = let h = go (p `div` 2) in (h * h) `mod` nb

We also have some support code for QuickCheck testing of the primitive root algorithm:

-- | QuickCheck generator for prime values.  Inefficient...
newtype Prime a = Prime { getPrime :: a }
                deriving (Eq, Ord, Show, Read, Num, Integral, Real, Enum)
instance (Integral a, Ord a, Arbitrary a) => Arbitrary (Prime a) where
  arbitrary = (Prime . (\n -> P.head $ P.dropWhile (< n) primes)) `fmap`
              (arbitrary `suchThat` (> 1))

-- Test code for primitive root determination.

prop_primitive_root ((Prime n) :: Prime Int) =
  primitiveRootTest n $ primitiveRoot n

primitiveRootTest :: Int -> Int -> Bool
primitiveRootTest p root
  | isPrime p = (sort $ toList $ pows) == [1..p-1]
  | otherwise = error "Attempt to take primitive root of non-prime value"
  where pows = generate (p - 1) (expt p root)

We need to generate prime numbers to test the primitive root calculation, and it’s much quicker to write a separate QuickCheck generator, using a specialised Arbitrary instance for a custom Prime newtype to do this than to try generating positive integers and filtering out non-primes using QuickCheck’s ==> filtering operator. We can collect some evidence that the algorithm works by doing something like this in GHCi:

> :load Prime-FFT.hs
[1 of 1] Compiling Prime-FFT         ( Prime-FFT.hs, interpreted )
Ok, modules loaded: PrimeFFT.
> verboseCheckWith (stdArgs { maxSize=25 }) prop_primitive_root
Prime {getPrime = 3}
Prime {getPrime = 7}
Prime {getPrime = 14771}
Prime {getPrime = 6691}
Prime {getPrime = 23753}
+++ OK, passed 100 tests.

Testing of Rader’s algorithm

We can use the same approach for testing our implementation of Rader’s algorithm that we used for testing the mixed-radix FFT algorithm, i.e. by comparing Rader FFT results to results of the basic DFT algorithm, and testing FFT/IFFT round-trip calculations. The only slight wrinkle here is that we need to test for input vectors of prime length only, so that we need a specialised Arbitrary instance to generate these:

instance Arbitrary VCD where
  arbitrary = do
    len <- elements $ P.takeWhile (< 500) primes
    fromList <$> vectorOf len arbitrary

With this, we can do some testing. In GHCi:

> :load Prime-FFT.hs
[1 of 1] Compiling PrimeFFT           ( Prime-FFT.hs, interpreted )
Ok, modules loaded: PrimeFFT.
> quickCheck prop_dft_vs_fft
+++ OK, passed 100 tests.
> quickCheck prop_ifft
+++ OK, passed 100 tests.

Nice. It works. There are definitely improvements that we can make to the code, which we’ll do when we start optimising the overall FFT calculation, but the algorithm is complicated enough that it’s nice just to have a working version to start from!

Site content copyright © 2011-2013 Ian Ross       Powered by Hakyll