git clone ''

(ql:quickload :napa-fft3)

Napa-FFT3: Overview

Napa-FFT3 is a complete rewrite of Napa-FFT (version 2 is an aborted experiment). The goal is still the same: to provide, via a mixture of cache-friendly algorithms and code generation, FFT routines in Common Lisp that offer performance comparable to the state of the art. In that regard, it is a success: depending on how it's used, Napa-FFT3 is, at most, around three times as slow as FFTW on small or medium inputs, and can be faster than FFTW for large inputs. The complete picture is more complicated than this; see the Performance section for details.

The goal of Napa-FFT3 isn't only to provide Discrete Fourier Transform (DFT) routines, but also (rather) to provide buildings blocks to express common operations that involve DFTs: filtering, convolutions, etc. This is what enables Napa-FFT to achieve such high performance without optimizing at the assembly level. The Easy Interface section should suffice for most developers; the Low-level Interface is described in another section, and may be of interest to some.

Napa-FFT3 also expressly supports FFTs on real data and inverse FFTs back to real data. The Real Interface section describes the facility, and is used in conjunction with the Easy Interface.

Finally, see the Installation section for installation instructions, and the Implementation section for all the gory details.

Note that Napa-FFT3 currently only supports power-of-two-sized inputs; even when/if it will gain code for arbitrary sizes, powers of two will most likely be much more efficient, both in terms of runtime and space usage.

To recapitulate:


Napa-FFT3 is a regular ASDF system defined in napa-fft3.asd. If Quicklisp is installed, it suffices to copy the Napa-FFT3 directory under ~/quicklisp/local-projects.

Once registered with ASDF, Napa-FFT3 can be loaded by executing (asdf:oos 'asdf:load-op "napa-fft3"), or, with Quicklisp, (ql:quickload "napa-fft3").

Easy Interface

The core of the “easy” interface consists of:


Syntax: fft vec &key dst size in-order scale window => vector.

Arguments and Values:

FFT computes the DFT of the first size values in vec.

First, vec is converted to a simple array of complex samples if necessary. The result is stored in dst, or a fresh array of complex doubles. dst may be the same object as vec for an in-place transform.

If window is non-nil, each value in vec is multiplied by the corresponding value in window during the transform; similarly, the values are scaled according to the value of scale.

If in-order is true, the result is then converted to be in order, which can take more than half as much time as the FFT itself.


CL-USER> (napa-fft:fft '(0 1 2 3 4 5 6 7))
#(#C(28.0d0 0.0d0) #C(-4.0d0 9.65685424949238d0) #C(-4.0d0 4.0d0)
  #C(-4.0d0 1.6568542494923806d0) #C(-4.0d0 0.0d0)
  #C(-4.0d0 -1.6568542494923806d0) #C(-4.0d0 -4.0d0)
  #C(-4.0d0 -9.65685424949238d0))

;; the same, but bit reversed
CL-USER> (napa-fft:fft '(0 1 2 3 4 5 6 7) :in-order nil)
#(#C(28.0d0 0.0d0) #C(-4.0d0 0.0d0) #C(-4.0d0 4.0d0) #C(-4.0d0 -4.0d0)
  #C(-4.0d0 9.65685424949238d0) #C(-4.0d0 -1.6568542494923806d0)
  #C(-4.0d0 1.6568542494923806d0) #C(-4.0d0 -9.65685424949238d0))

;; :scale nil is the default
CL-USER> (napa-fft:fft '(0 1 2 3) :scale nil)
#(#C(6.0d0 0.0d0) #C(-2.0d0 2.0d0) #C(-2.0d0 0.0d0) #C(-2.0d0 -2.0d0))

;; the same, but scaled by 1/4
CL-USER> (napa-fft:fft '(0 1 2 3) :scale t)
#(#C(1.5d0 0.0d0) #C(-0.5d0 0.5d0) #C(-0.5d0 0.0d0) #C(-0.5d0 -0.5d0))

;; again, scaled by 1/sqrt(4) = 1/2
CL-USER> (napa-fft:fft '(0 1 2 3 5 6 7 8) :size 4 :scale :sqrt)
#(#C(3.0d0 0.0d0) #C(-1.0d0 1.0d0) #C(-1.0d0 0.0d0) #C(-1.0d0 -1.0d0))


Syntax: ifft vec &key dst size in-order scale window => vector.

Arguments and Values:

IFFT computes the inverse DFT of the first size Fourier coefficients in vec.

First, vec is converted to a simple array of complex samples if necessary. The result is stored in dst, or a fresh array of complex doubles. dst may be the same object as vec for an in-place transform.

If in-order is true, the result is then converted to be bit-reversed, which can take more than half as much time as the FFT itself.

If window is non-nil, each value in vec is multiplied by the corresponding value in window during the transform; similarly, the values are scaled according to scale. Note that this happens after the bit-reversal. window should thus be bit-reversed itself. Since this corresponds to a convolution, this is usually easily satisfied.


;; the defaults ensure fft and ifft are inverses
CL-USER> (napa-fft:ifft (napa-fft:fft '(0 1 2 3)))
#(#C(0.0d0 0.0d0) #C(1.0d0 0.0d0) #C(2.0d0 0.0d0) #C(3.0d0 0.0d0))

;; Skipping both bit-reversals saves a lot of time, without
;; changing the result; similarly, scaling can happen at any
;; step.
CL-USER> (napa-fft:ifft (napa-fft:fft '(0 1 2 3) :in-order nil
                                  :scale :sqrt)
                    :in-order nil
                    :scale :sqrt)
#(#C(0.0d0 0.0d0) #C(1.0d0 0.0d0) #C(2.0d0 0.0d0) #C(3.0d0 0.0d0))

;; The :window argument performs convolutions; simply make
;; sure that :in-order is nil when transforming the convolved
;; vector.  Here, the input is shifted one element to the
;; right and scaled by 1/2.
CL-USER> (napa-fft:ifft (napa-fft:fft '(0 1 2 3))
                        :window (napa-fft:fft '(0 1/2 0 0)
                                              :in-order nil))
#(#C(1.5d0 0.0d0) #C(0.0d0 0.0d0) #C(0.5d0 0.0d0) #C(1.0d0 0.0d0))


Syntax: bit-reverse vec &optional dst size => vector.

Arguments and values:

BIT-REVERSE permutes the first size elements in vec to bit-reversed indices, storing the result in dst if provided. vec may be the same as dst for an in-place reversal.

It should usually not be necessary to bit-reverse values explicitly, but it may still be useful to convert an in-order vector to out-of-order or vice-versa.


CL-USER> (napa-fft:bit-reverse (coerce '(0d0 1d0 2d0 3d0)
#(0.0d0 2.0d0 1.0d0 3.0d0)

;; bit-reverse is its own inverse.
CL-USER> (napa-fft:bit-reverse * *)
#(0.0d0 1.0d0 2.0d0 3.0d0)


TODO. It's the same as Bordeaux FFT.


Real Interface

The real interface offers three functions specialized to operate on real (not complex) data:

These are convenient because the result is a vector of real values, but also offer strong performance improvements (almost halving computation times) for in-order, out-of-place, transforms, at the expense of a little precision.


Syntax: rfft vec &key dst size scale => vector

Arguments and values:

RFFT computes the in-order DFT of the first size samples in vec.

vec is converted, if necessary, to a simple array of doubles. Its DFT is then re-expressed as a half-size DFT of complex samples, and the result is written in dst.


;; rfft should always yield the same result as fft (modulo
;; rounding)
CL-USER> (napa-fft:rfft '(0 1 2 3))
#(#C(6.0d0 0.0d0) #C(-2.0d0 2.0d0) #C(-2.0d0 0.0d0)
  #C(-1.9999999999999998d0 -2.0d0))


Syntax: rifft vec &key dst size scale => vector

Arguments and values: * vec: sequence of complex Fourier coefficients. Destroyed. * dst: nil, or a simple vector of doubles. Destructively reused. * size: size of the inverse transform to perform. Defaults to (length vec). Must be a power of two. * scale: how the result should be scaled: not at all (nil), but 1/sqrt(size) (:sqrt or sqrt) or by 1/n (default, t or :inv). * vector: a simple array of doubles. dst if not nil, otherwise a newly-allocated array.

RIFFT computes the in-order inverse DFT of the first size Fourier coefficients in vec.

vec is converted, if necessary, to a simple array of complex doubles. Its inverse DFT is then re-expressed, destructively as a half-size DFT of complex doubles. The procedure can only work correctly when the output is purely real: even if we're only interested in the real component, it will be distorted by any non-zero imaginary value.


CL-USER> (napa-fft:rifft (napa-fft:rfft '(0 1 2 3)))
#(0.0d0 1.0d0 2.0d0 3.0d0)





I'm no good at this signal processing stuff; most of what I know about this comes, indirectly, from classical music training.

I'll use sox to play sound files easily, and the following function to convert float samples to s32 files:

(defun emit-raw32-file (file data &optional (max 1d0))
  (with-open-file (s file :direction :output :element-type '(signed-byte 32)
                          :if-exists :supersede)
    (write-sequence (map-into (make-array (length data)
                                          :element-type '(signed-byte 32))
                              (lambda (data)
                                (let ((x (/ (float (realpart data) 1d0)
                                  (floor (* x (1- (ash 1 31))))))

Napa-FFT3 does a fair amount of runtime compilation and cached computations. If an operation is slow (more than one second), it was almost certainly due to a slow one-time operation; try running it again. Also, there's a lot of gratuitous consing, here. Nearly all the operations could be in-place and (nearly) non-consing without any change to the code.

Generate a single tone

The standard frequency of the middle A is 440 Hz nowadays. Sound is commonly generated, on computers, at 44100 Hz. Let's generate a signal of 64k points that, when played back at 44100 Hz, will result in an A440. The following function returns a vector with zeros everywhere except where specified in the first argument, a list designator of indices.

(defun impulse (i n &optional (value 1d0))
  (let ((vec (make-array n :element-type 'napa-fft:complex-sample
                           :initial-element (complex 0d0 0d0))))
    (dolist (i (if (listp i) i (list i)) vec)
      (setf (aref vec i) (complex (float value 1d0))))))

The DFT computes a vector such that each entry corresponds to the amount of energy in the frequency(ies) corresponding to that entry. The i th entry of an N -element Fourier-coefficient vector, with an original sampling frequency of F corresponds to a signal frequency of iF/N (+/- F/N ). So, if we want a wave at 440 Hz in in a vector of 64k played back at 44100 Hz, we have to put energy in the (440/44100)*65536 th bin.

;; create a vector with 1 only in the bin corresponding to 440 Hz,
;; convert back to the time domain, and save the double values as
;; a file of (signed-byte 32) values.
CL-USER> (emit-raw32-file "~/napa-fft3/example/a440.s32"
                          (napa-fft:ifft (impulse (round (* 440 65536)
                                         :scale nil))

$ play -r44100 a440.s32 # play is a sox command; play a file of
                        # signed 32 bit samples at 44100 Hz.
                        # Should sound like an A440!

Generate a chord

We (westerners) are mostly used to a scale system based on 12 equispaced (not quite, but close enough) semi-tones. Reality tends to enforce that the scale covers a range of frequences for F to 2F. For example, while the middle A is at 440 Hz, the one in the next octave is at 880 Hz. The middle C is 9 half-tones lower than the middle A, at (* 440 (expt .5 (/ 9 12))), around 262 Hz (264 Hz in the real world). E is then 2 tones higher, at (* 262 (expt 2 (/ 4 12))) ~= 330 Hz, and G a tone and a half higher again, (* 330 (expt 2 (/ 3 12))) ~= 392 Hz.

Thus, to hear the boring middle C/E/G chord, we need energy at 262, 330 and 392 Hz; note how the energy is 1/3 at each point, so that the total comes to 1.

CL-USER> (emit-raw32-file "~/napa-fft3/example/chord.s32"
                          (napa-fft:ifft (impulse (list (round (* 262 65536)
                                                        (round (* 330 65536)
                                                        (round (* 392 65536)
                                                  (/ 3d0))
                                         :scale nil))

$ play -r44100 chord.s32 # you should recognize this sound

Filter frequencies out

First, let's save our time-domain chord signal:

CL-USER> (defparameter *chord* (napa-fft:ifft
                                (impulse (list (round (* 262 65536) 44100)
                                               (round (* 330 65536) 44100)
                                               (round (* 392 65536) 44100))
                                         (/ 3d0))
                                :scale nil))

Here's a function to generate random noise and another to average two vectors:

(defun noise (n &optional (range .5d0))
  (let ((2range (* 2 range)))
    (map-into (make-array n :element-type 'napa-fft:complex-sample)
              (lambda ()
                (complex (- (random 2range) range))))))

(defun m+ (x y &optional (scale .5d0))
  (map 'napa-fft:complex-sample-array
       (lambda (x y)
         (* scale (+ x y)))
       x y))

We can noise our signal up by adding noise to the chord:

CL-USER> (defparameter *noisy-chord* (m+ *chord* (noise 65536)))

$ play -r44100 noised-chord.s32 # still recognizable, but
                                # annoying.

Let's say that we know that the only interesting stuff is in the central octave. We could zero out all the frequencies outside the 262-524 Hz range.

First, we have to convert the noisy signals in the frequency domain:

CL-USER> (defparameter *noisy-chord-freq* (napa-fft:fft *noisy-chord*))

Then, we want to replace everything outside (round (* 262 65536) 44100) and (round (* 524 65536) 44100) with 0:

CL-USER> (defparameter *octave-chord-freq* (copy-seq *noisy-chord-freq*))
CL-USER> (prog1 nil
           (fill *octave-chord-freq* (complex 0d0)
                 :end (round (* 262 65536) 44100))
           (fill *octave-chord-freq* (complex 0d0)
                 :start (1+ (round (* 524 65536) 44100))))

Now, we can convert back in the time domain and listen to the result:

CL-USER> (emit-raw32-file "~/napa-fft3/example/octave-chord.s32"
                          (napa-fft:ifft *octave-chord-freq*))

Much better.

Filter frequencies out in a real signal

Note that, as is often the case, we know that our signal is real (has no imaginary component). In this case, we can use real-only FFT/IFFT instead:

CL-USER> (defparameter *noisy-chord-freq*
           (napa-fft:rfft (map 'napa-fft:real-sample-array
CL-USER> (prog1 nil
           (fill *octave-chord-freq* (complex 0d0)
                 :end (round (* 262 65536) 44100))
           (fill *octave-chord-freq* (complex 0d0)
                 :start (1+ (round (* 524 65536) 44100))))
CL-USER> (emit-raw32-file "~/napa-fft3/example/octave-chord.s32"
                          (napa-fft:rifft *octave-chord-freq*))

The result is the same, but (once the routines are compiled), each FFT/IFFT is about twice as fast. Mind the fact that rifft is destructive on its input, however. It also only works when the output is purely real; any non-zero imaginary component will warp the real-valued output.

Filter frequencies out during the IFFT

We can do this directly, by filtering during the inverse FFT. Replacing most values with 0 and leaving the rest along is equivalent to multiplying by 0 or 1 (usually, we want a more gradual dampening, and the filter will also have fractional values).

CL-USER> (defun central-octave-filter (i n)
           (if (<= (round (* 262 n) 44100)
                   (round (* 524 n) 44100))
               1 0))
CL-USER> (defparameter *filter*
           (napa-fft:window-vector 'central-octave-filter
CL-USER> (defparameter *filtered-noisy-chord-freq*
           (map 'napa-fft:complex-sample-array #'*
                *filter* *noisy-chord-freq*))
CL-USER> (emit-raw32-file "~/napa-fft3/example/octave-chord2.s32"
                          (napa-fft:ifft *filtered-noisy-chord-freq*))

And we have the same final result. Obviously, an advantage is that we can compute the filter once, and easily apply it directly.

Another advantage is that we can bit-reverse (permute) the filter instead of bit-reversing after the FFT and before the IFFT. Again, the final result is the same, but we only bit-reverse the (cached) filter vector instead of many frequency domain vector.

CL-USER> (emit-raw32-file "~/napa-fft3/example/octave-chord3.s32"
                           (napa-fft:fft *noisy-chord* :in-order nil)
                           :window-fn 'central-octave-filter
                           :in-order nil))

Filter low-amplitude noise out

In the previous example we already knew at what frequency the interesting signal was. That's sometimes the case (e.g. when some of us unconsciously filter out annoyingly-high-pitched voices), but somewhat uncommon.

Instead of applying a constant filter on frequencies, we can attempt to find which frequencies dominate (have a high energy), and eliminate the rest.

Filter by strength

The easiest way is to find the frequency with the most energy, and clamp out everything that's below a fixed fraction (e.g. 1%) of that.

(defun amplitude-clamp-fraction (vector fraction)
  (let ((limit (* (reduce #'max vector :key #'abs)
    (map 'napa-fft:complex-sample-array
         (lambda (x)
           (if (< (abs x) limit)
               (complex 0d0)

This function performs the same operation, regardless of the order in which its data is permuted. We can thus use out-of-order FFTs and IFFTs here as well.

CL-USER> (emit-raw32-file "~/napa-fft3/example/octave-chord4.s32"
                            (napa-fft:fft *noisy-chord* :in-order nil)
                           :in-order nil))

When frequencies below 1% of the max energy are zeroed out, the amount of noise is significantly reduced, but there's still an annoying ringing. Increasing the limit to 2% improves matters a lot; however, if we set too high a limit, we'll start to filter out interesting, if somewhat weak, sounds.

Filter by count

We could also know ahead of time that there are only a few (e.g. 3) interesting frequencies in the signal. In that case, we can find the k frequencies with the most energy, and remove everything else:

(defun amplitude-clamp-k (vector k)
  (let* ((n      (length vector))
         (values (make-array n)))
    (dotimes (i n)
      (setf (aref values i) (cons (abs (aref vector i)) i)))
    ;; this would be a heap or a quickselect if I cared
    (sort values #'> :key #'car)
    (let ((result (make-array n
                              :element-type 'napa-fft:complex-sample
                              :initial-element (complex 0d0))))
      (loop repeat k
            for (nil . i) across values
            do (setf (aref result i) (aref vector i))
            finally (return result)))))

Again, the input may be permuted without changing the result, so we can stick to out-of-order transforms:

CL-USER> (emit-raw32-file "~/napa-fft3/example/octave-chord5.s32"
                            (napa-fft:fft *noisy-chord* :in-order nil)
                            3) ; we're cheating here (:
                           :in-order nil))

Windowing chunks of long signals

Let's filter Justin Bieber out of one of his songs. That's actually a very difficult task, especially for example code: voices tend to have rich harmonics, and span a wide range of frequencies. What we can do is filter every frequency higher than a certain limit.

First, I use sox to convert the input (in wave format) to mono s32:

$ sox never.wav -r 44100 -c 1 never.s32

I can now read the first minute and convert it to doubles:

(defun read-raw32-file (file &optional n (max 1d0))
  (with-open-file (s file :element-type '(signed-byte 32))
    (let* ((n   (or n
                    (file-length s)))
           (seq (make-array n :element-type '(signed-byte 32))))
      (read-sequence seq s)
      (let ((scale (expt (* 2d0 max) -31)))
        (declare (type double-float scale))
        (map 'napa-fft:real-sample-array
             (lambda (x)
               (* x scale))

CL-USER> (defparameter *never* (read-raw32-file "~/napa-fft3/example/never.s32"
                                                (* 44100 60)))

We could just transform the whole 60 seconds at once, and revert it, but that'd be a lot of work. Instead, we'll chunk it into short periods, say one tenth of a second, and filter each chunk separately. We'll simplify things and use chunks of 4096 samples (slightly less than 4410 samples).

Some fiddling around lets us discover than Bieber seems to sing around 260 Hz and higher, so we'll cut off frequencies from 260 Hz and up.

CL-USER> (defparameter *jb-filter*
           (napa-fft:window-vector (lambda (i n)
                                     (if (< i (round (* 260 n) 44100))
                                         1 0))

The following function will apply the same filter to each chunk of samples.

(defun filter-chunks (vector filter)
  (declare (type napa-fft:real-sample-array vector filter))
  (let* ((destination (make-array (length vector)
                                  :element-type 'napa-fft:complex-sample))
         (chunk-size  (length filter))
         (chunk       (make-array chunk-size
                                  :element-type 'napa-fft:complex-sample))
         (filter      (napa-fft:bit-reverse filter)))
    (declare (optimize speed))
    (loop for i below (length vector) by chunk-size
          for end = (min (length vector) (+ i chunk-size))
          do (fill chunk (complex 0d0))
             (replace chunk vector :start2 i :end2 end)
             (napa-fft:fft chunk :dst chunk :in-order nil)
             (napa-fft:ifft chunk
                            :dst chunk :in-order nil
                            :window filter)
             (replace destination chunk :start1 i :end2 end)
          finally (return destination))))

We can then directly filter all the high frequencies in *never*.

CL-USER> (emit-raw32-file "~/napa-fft3/example/never-prime.s32"
                          (filter-chunks *never* *jb-filter*))

The filtering process loses a lot of volume. We can recover that by scaling the output so that its 2-norm is the same. There's also an annoying noise: that's an artefact of the way we cut our signal up and pretend each chunk is periodic. The latter problem is what the windowing support attempts to address.

Instead of filtering each chunk independently, we'll use a triangular window, and overlap the chunks so that each sample is processed exactly twice. The triangular window ensures that the two weights assigned to each sample adds to 1. windowed-fft takes care of applying the triangular window to the input, and a scaling factor is applied so that the energy in the output is half that in the input. The result of the thunks are then added together, with the same overlap as the input. filter-chunks becomes:

(defun filter-chunks (vector filter)
  (declare (type napa-fft:real-sample-array vector filter))
  (let* ((destination (make-array (length vector)
                                  :element-type 'napa-fft:complex-sample
                                  :initial-element (complex 0d0)))
         (chunk-size  (length filter))
         (half-size   (truncate chunk-size 2))
         (chunk       (make-array chunk-size
                                  :element-type 'napa-fft:complex-sample))
         (filter      (napa-fft:bit-reverse filter)))
    (declare (optimize speed))
    (loop for i below (length vector) by half-size
          for end = (min (length vector) (+ i chunk-size))
          do (fill chunk (complex 0d0))
             (loop for dst upfrom 0
                   for src from i below end
                   do (setf (aref chunk dst) (complex (aref vector src))))
             (let ((old-energy (energy chunk)))
               (napa-fft:windowed-fft chunk half-size
                                      :window-fn 'napa-fft:triangle
                                      :dst chunk :in-order nil)
               (napa-fft:ifft chunk
                              :dst chunk :in-order nil
                              :window filter)
               (let* ((new-energy (energy chunk))
                      (scale      (* .5d0 (sqrt (/ old-energy
                 (declare (type double-float scale))
                 (loop for src upfrom 0
                     for dst from i below end
                     do (incf (aref destination dst)
                              (* scale (aref chunk src))))))
      finally (return destination))))

When I play that back, I only hear high quality percussions and little to no artefacts.

Multiplying integers or polynomials via a convolution

So far, we've been using point-wise multiplication in the frequency domain to filter frequencies out. We can also see it as a nice way to execute convolutions. We can use this to implement fast multiplication of polynomials or of integers, with the right encoding. However, please don't use it for bignum multiplications without checking the precision of the transforms.

Let's multiply 1005 by 1234. 1005 is 1*10^3 + 0*10^2 + 0*10^1 + 5*10^0, which we can encode, as a vector: #(1 0 0 5 0 0 0 0) (note the padding at the end), and similarly for 1234. Again, filtering is oblivious to any permutation (as long as the windowing vector is bit-reversed), so we can do everything out of order.

CL-USER> (napa-fft:fft #(1 0 0 5 0 0 0 0) :in-order nil)
#(#C(6.0d0 0.0d0) #C(-4.0d0 0.0d0) #C(1.0d0 5.0d0) #C(1.0d0 -5.0d0)
  #C(-2.5355339059327378d0 -3.5355339059327378d0)
  #C(4.535533905932738d0 3.5355339059327378d0)
  #C(4.535533905932738d0 -3.5355339059327378d0)
  #C(-2.5355339059327378d0 3.5355339059327378d0))
CL-USER> (napa-fft:fft #(1 2 3 4 0 0 0 0) :in-order nil)
#(#C(10.0d0 0.0d0) #C(-2.0d0 0.0d0) #C(-2.0d0 2.0d0) #C(-2.0d0 -2.0d0)
  #C(-0.41421356237309515d0 -7.242640687119286d0)
  #C(2.414213562373095d0 1.2426406871192857d0)
  #C(2.414213562373095d0 -1.2426406871192857d0)
  #C(-0.41421356237309515d0 7.242640687119286d0))
CL-USER> (napa-fft:ifft * :window ** :in-order nil)
#(#C(0.9999999999999982d0 0.0d0) #C(1.9999999999999991d0 0.0d0)
  #C(3.0d0 0.0d0) #C(9.0d0 0.0d0) #C(10.000000000000002d0 0.0d0)
  #C(15.0d0 0.0d0) #C(20.0d0 0.0d0) #C(-8.881784197001252d-16 0.0d0))

If we remove the imaginary portions (which are all 0) and round some numerical errors away, we find #(1 2 3 9 10 15 20 0), this time with only one element of padding; this value represents 1*10^6 + 2*10^5 + 3*10^4 + 9*10^3 + 10*10^2, etc. If we take care of the carries, we find 1240170, which is indeed 1005 * 1234.

Of course, we can also exploit the fact that the input and output are all reals to use rfft and rifft. We can do even better with %2rfft, which performs 2 real fft at the same time. However, if we do that, we have to use in-order transforms and perform the element-wise multiplication ourselves. It's a trade off, and even when we're only concerned with computation times, the right answer depends on a lot of variables.

;; the results are returned one after the other in a single
;; vector of complex doubles
CL-USER> (napa-fft:%2rfft '(1 0 0 5 0 0 0 0)
                          '(1 2 3 4 0 0 0 0))
#(#C(6.0d0 0.0d0) #C(-2.5355339059327378d0 -3.5355339059327378d0)
  #C(1.0d0 5.0d0) #C(4.535533905932738d0 -3.535533905932738d0)

CL-USER> (let ((x (subseq * 0 8))
               (y (subseq * 8)))
           (napa-fft:rifft (map-into x #'* x y)))
#(0.9999999999999991d0 2.0d0 2.9999999999999964d0 9.0d0 10.0d0 15.0d0
  20.000000000000004d0 -8.881784197001252d-16)

Low-level Interface

These functions directly expose the runtime code generator to let you avoid all the argument-list parsing overhead in the regular interface, and hoist the lookups outside performance-critical code. All the generators are memoised in specials, but accesses are protected by mutexes (and are atomic on SBCL anyway), so there should not be any threading issue.

The low-level interface consists of the three following generators; they don't offer any functionality absent from the easy interface, but make it possible to dispatch once and subsequently call the right function directly.

The previous three generators actually depend on these memoised generators, which allow even lower-level accesses.

The functions returned by these generators perform virtually no error checking; make sure to use them correctly.


Syntax: get-fft size &key forward scale in-order => fft-function.

Arguments and Values:

GET-FFT returns a function that computes the DFT of the first size elements of its single argument. That argument must be a complex-sample-array, and will be transformed in-place.


CL-USER> (let ((fft (napa-fft:get-fft 8))
               (data (make-array 8 :element-type 'napa-fft:complex-sample
                                   :initial-element (complex 1d0 0d0))))
           ;; data transformed in-place
           (funcall fft data))
#(#C(8.0d0 0.0d0) #C(0.0d0 0.0d0) #C(0.0d0 0.0d0) #C(0.0d0 0.0d0)
  #C(0.0d0 0.0d0) #C(0.0d0 0.0d0) #C(0.0d0 0.0d0) #C(0.0d0 0.0d0))


Syntax: get-windowed-fft size window-type &key forward scale in-order => fft-function

Arguments and Values:

GET-WINDOWED-FFT returns a function that computes the DFT of the first size elements of its first argument. Conceptually, this first argument is first multiplied element-wise by the second argument (the window); in practice, this is executed as part of the FFT.

For inverse FFTs, the multiplication happens after the bit-reversal; the window should thus itself be bit-reversed.


Syntax: get-reverse n &optional eltype => reversal-function

Arguments and Values:

GET-REVERSE returns a function that perform an in-place bit-reversal of the first size elements of its single argument.


Syntax: %ensure-fft direction scaling windowing n => fft-function

Arguments and Values:

%ENSURE-FFT is similar to GET-FFT or GET-WINDOWED-FFT, except that the start indices may be specified.


Syntax: %ensure-twiddles n forwardp => twiddle-vector

Argument and Values:

%ENSURE-TWIDDLES computes and caches vectors that should be passed as the last argument of the function returned by %ENSURE-FFT. The structure of the vectors is such that a vector appropriate for a transform of size n is also appropriate for all smaller sizes (with the same direction).


Syntax %ensure-reverse n &optional eltype => reversal-function

Argument and Values:

%ENSURE-REVERSE is exactly like GET-REVERSE, but allows the user to specify the starting index of the vector to bit-reverse.


Napa-FFT3 is based on a split-radix, out-of-order and in-place variant of the Cooley-Tukey FFT algorithm. Split-radix is relatively simple, and achieve operation counts very close (within a couple percents) to the minimum known so far. Doing it out of order simplifies the transform code a lot; this way, all the accesses are naturally in-place and follow a streaming order, at each level of the recursion. Better: by executing the recursion depth-first rather than breadth-first, the code exploits caches implicitly. Finally, an in-place transform can hope to fit nearly twice as large inputs in cache as an out-of-place one, as there is no auxiliary output vector. All in all, it looks like a good choice of algorithm: close enough to the theoretical optimum, interesting performance properties, and not too complex to implement.

Many operations are as easily expressed on bit-reversed as on natural-order frequency-domain values (e.g. convolutions, or filtering noise out). That's not always the case, unfortunately, so Napa-FFT also implements a bit-reversal pass.

In the past, this was often slow enough to make it vastly preferable to instead implement an in-order (autosorting) FFT: the slow, bandwidth-bound, bit-reversal is merged with the more arithmetic-heavy FFT, hopefully resulting in faster code than executing each serially. However, as [Karp] points out, this seems to be better explained by bad code than anything else. [Karter and Gatlin] build on that and describe an algorithm designed to exploit memory caches, ensuring at most two misses per cache line of data; this is enough to obtain much better performance (or comparable) than all the algorithms reviewed in [Karp] across a range of nearly-contemporary machines. In a later paper, [Zhang and Zhang], note that we can exploit the high associativity in certain caches to simplify the code a lot, or improve on its performance. Their algorithms are somewhat complicated by explicit blocking loops; a clever recursion suffices to obtain access patterns appropriate for nearly all block sizes.

Split-radix FFT

The split-radix algorithm has a very short recursive definition; practically however, we want larger base cases than the strict minimum. Rather than depending on hand-written specialised base cases, Napa-FFT3 includes a specialised compiler that turns the execution trace for a given FFT into straight-line code. This way, we maintain the near-optimal operation count, especially since the code to multiply by twiddle factors can be optimised for “nice” constants (e.g. 1, i or sqrt(i)). The specialised compiler takes care of spilling values to the data vector according to Belady's algorithm. This will tend to perform a lot better than general-purpose spilling logic (e.g. SBCL's coloring-based algorithm), and spilled values are stored at correctly-aligned addresses, following a cache-friendly layout.

This, along with specialised recursive steps for each size, gives us high-performance out-of-order transforms. Much of the complexity in FFTs comes from ensuring the data are in natural order; it's not surprising that simple code can achieve runtimes comparable or lower than sophisticated code like FFTW once that constraint is relaxed.

The base cases are nevertheless incredibly naive, compared to FFTW's codelets. For tiny transforms, Napa-FFT is clearly not in the same league; however, as cache effects gain importance, the higher-level design choices pay off, and out-of-order Napa-FFT closes the gap with FFTW. In fact, it is even slightly faster for very large transforms.

Bit reversal

More surprising might be the fact that quick bit-reversals are now so easy to design. Much of the runtime in Napa-FFT2 was caused by the transposition steps. At first sight, a bit-reversal is even more complicated. However, bit-reversals have the nice property that they only consist of swaps: they can be easily be executed in-place, by swapping each element with its destination. In contrast, this is only true for transpose of square matrices (FFT sizes that are even powers of two).

Historically, on cache-ful computers, the issue with bit-reversal has been one of aliasing in low-associativity caches: the swap pattern involves addresses that tend to be mapped to the same cache lines. The workarounds involve some sort of software buffering, either in registers or in an auxiliary array, to supplement the caches.

Nowadays, however, computer architects have more transistors than they know what to do with, so caches are large and have high associativity (at least 4-way at the L2 or L3 level on both AMD and Intel chips). Since we're mostly conerned with bit-reversing complex-sample-vectors, each element is a pair of double-floats, and only 4 fit in each cache line. In this case, as [Zhang and Zhang 99] point out, the associativity is high enough not to necessitate any software buffering!

The novel part seems to be the traversal order. [Zhang and Zhang 99] point out that blocking helps with locality, and a few older work [Elster, Rutkowska] have described how bit-reversal swaps could be generated recursively. The blocking described in [Zhang and Zhang 99] is very much cache-aware, and must be explicitly structured to take advantage of cache levels and TLBs. It seems that all the recursively-generated swaps generate indices to swap from the outside in; that is, the leaf of the recursion have all the bits fixed except for the middle ones. If we wish to enhance locality, we should fix the middle bit first, and, at the leaves, have the outermost (i.e. most significant, but also least significant) bits vary. This way, we implicitly get blocking for any cache line size (given sufficient associativity), but also for fully-associative TLBs.

Another way to see this is that each swap tends to be between vastly different indices (bit reversal is bad for locality). One classic way to deal with this is to sort the swaps in Z-order, by interleaving bits from each swapped index. Unfortunately, with bit reversal, this is equivalent to recursing from the outside. Instead, we can sort the least significant half of the indices in Z-order, and that will make the outermost bit vary between adjacent swaps.

Obviously, determining this ordering at runtime is a lot of work. Instead, specialised leaf routines that handle changing, e.g., the top and bottom -most three bits (in the right traversal order), are called with the middle bits found in a pre-sorted vector.

On my workstation, this results in .3 cache miss/element (for complex double floats), which is only a bit more than the compulsory .25 miss/element. Surprisingly, while there are a lot ofTLB misses (at the first level, I suppose), eliminating them by switching to huge pages doesn't really improve runtimes. I'm currently thinking that's because the traversal order is actually tuned for the last level TLB, which is fully-associative. In the end, the net effect is that bit reversal of large vectors hits around 60 % of my workstation's out-of-cache streaming bandwidth (as measured by STREAM's copy loop). It's not instantaneous, but not an insurmountable handicap either.