Fft Calculating Incorrectly - Swift

AForge FFT gives a different result than Accelerate FFT

After the following subroutine

vDSP_ctoz((DSPComplex *)realData, 2, complexData, 1,samplesOver2);

is executed, complexData has samplesOver2 elements. But soon after that, you call

vDSP_zvabs(complexData, 1, spectrumData, 1, samples);

which expects complexData to have samples elements, i.e. twice as many. This cannot be.

Also, how is realData laid out? I ask because vDSP_ctoz expects its first argument to be laid out in the form

real0, imag0, real1, imag1, ... real(n-1), imag(n-1).

If your data is indeed real, then imag0, imag1, ... imag(n-1) should all be 0. If it is not, then vDSP_ctoz may not be expecting that. (Unless you are packing the real data in some clever way, which would be two [sic] clever by half!)

Finally, vDSP_create_fftsetup( 16, 2); should probably be changed to

vDSP_create_fftsetup(16, 0);

===================================================================

My sample code appended in postscript:

  FFTSetup fftSetup = vDSP_create_fftsetup(
                                           16,         // vDSP_Length __vDSP_log2n,
                                           kFFTRadix2  // FFTRadix __vDSP_radix
                                           // CAUTION: kFFTRadix2 is an enum that is equal to 0
                                           //          kFFTRadix5 is an enum that is equal to 2
                                           // DO NOT USE 2 IF YOU MEAN kFFTRadix2
                                           );
  NSAssert(fftSetup != NULL, @"vDSP_create_fftsetup() failed to allocate storage");

  int numSamples = 65536;  // numSamples must be an integer power of 2; in this case 65536 = 2 ^ 16
  float realData[numSamples];

  // Prepare the real data with (ahem) fake data, in this case
  // the sum of 3 sinusoidal waves representing a C major chord.
  // The fake data is rigged to have a sampling frequency of 44100 Hz (as for a CD).
  // As always, the Nyquist frequency is just half the sampling frequency, i.e., 22050 Hz.
  for (int i = 0; i < numSamples; i++)
  {
    realData[i] = sin(2 * M_PI * 261.76300048828125 * i / 44100.0)  // C4 = 261.626 Hz
                + sin(2 * M_PI * 329.72717285156250 * i / 44100.0)  // E4 = 329.628 Hz
                + sin(2 * M_PI * 392.30804443359375 * i / 44100.0); // G4 = 391.995 Hz
  }

  float splitReal[numSamples / 2];
  float splitImag[numSamples / 2];

  DSPSplitComplex splitComplex;
  splitComplex.realp = splitReal;
  splitComplex.imagp = splitImag;

  vDSP_ctoz(
            (const DSPComplex *)realData,  // const DSPComplex __vDSP_C[],
            2,                             // vDSP_Stride __vDSP_strideC,  MUST BE A MULTIPLE OF 2
            &splitComplex,                 // DSPSplitComplex *__vDSP_Z,
            1,                             // vDSP_Stride __vDSP_strideZ,
            (numSamples / 2)               // vDSP_Length __vDSP_size
            );

  vDSP_fft_zrip(
                fftSetup,                               // FFTSetup __vDSP_setup,
                &splitComplex,                          // DSPSplitComplex *__vDSP_ioData,
                1,                                      // vDSP_Stride __vDSP_stride,
                (vDSP_Length)lround(log2(numSamples)),  // vDSP_Length __vDSP_log2n,
                // IMPORTANT: THE PRECEDING ARGUMENT MUST BE LOG_BASE_2 OF THE NUMBER OF floats IN splitComplex
                // FOR OUR EXAMPLE, THIS WOULD BE (numSamples / 2) + (numSamples / 2) = numSamples
                kFFTDirection_Forward                   // FFTDirection __vDSP_direction
                );

  printf("DC component = %f\n", splitComplex.realp[0]);
  printf("Nyquist component = %f\n\n", splitComplex.imagp[0]);

  // Next, we compute the Power Spectral Density (PSD) from the FFT.
  // (The PSD is just the magnitude-squared of the FFT.)
  // (We don't bother with scaling as we are only interested in relative values of the PSD.)
  float powerSpectralDensity[(numSamples / 2) + 1];  // the "+ 1" is to make room for the Nyquist component

  // We move the Nyquist component out of splitComplex.imagp[0] and place it
  // at the end of the array powerSpectralDensity, squaring it as we go:
  powerSpectralDensity[numSamples / 2] = splitComplex.imagp[0] * splitComplex.imagp[0];

  // We can now zero out splitComplex.imagp[0] since the imaginary part of the DC component is, in fact, zero:
  splitComplex.imagp[0] = 0.0;

  // Finally, we compute the squares of the magnitudes of the elements of the FFT:
  vDSP_zvmags(
              &splitComplex,         // DSPSplitComplex *__vDSP_A,
              1,                     // vDSP_Stride __vDSP_I,
              powerSpectralDensity,  // float *__vDSP_C,
              1,                     // vDSP_Stride __vDSP_K,
              (numSamples / 2)       // vDSP_Length __vDSP_N
              );

  // We print out a table of the PSD as a function of frequency
  // Replace the "< 600" in the for-loop below with "<= (numSamples / 2)" if you want
  // the entire spectrum up to and including the Nyquist frequency:
  printf("Frequency_in_Hz    Power_Spectral_Density\n");
  for (int i = 0; i < 600; i++)  
  {
    printf("%f,          %f\n", (i / (float)(numSamples / 2)) * 22050.0, powerSpectralDensity[i]);
    // Recall that the array index i = 0 corresponds to zero frequency
    // and that i = (numSamples / 2) corresponds to the Nyquist frequency of 22050 Hz.
    // Frequency values intermediate between these two limits are scaled proportionally (linearly).
  }

  // The output PSD should be zero everywhere except at the three frequencies
  // corresponding to the C major triad.  It should be something like this:

/***************************************************************************
 DC component = -0.000000
 Nyquist component = -0.000000

 Frequency_in_Hz    Power_Spectral_Density
 0.000000,          0.000000
 0.672913,          0.000000
 1.345825,          0.000000
 2.018738,          0.000000
 2.691650,          0.000000
 .
 .
 .
 260.417175,          0.000000
 261.090088,          0.000000
 261.763000,          4294967296.000000
 262.435913,          0.000000
 263.108826,          0.000000
 .
 .
 .
 328.381348,          0.000000
 329.054260,          0.000000
 329.727173,          4294967296.000000
 330.400085,          0.000000
 331.072998,          0.000000
 .
 .
 .
 390.962219,          0.000000
 391.635132,          0.000000
 392.308044,          4294966784.000000
 392.980957,          0.000000
 393.653870,          0.000000
 .
 .
 .
***************************************************************************/

  vDSP_destroy_fftsetup(fftSetup);

Swift FFT - Complex split issue

There were a couple of problems with your code:

you weren't reading in the audio file samples
channelSamples was packed incorrectly
vDSP_fft_zrip was reading beyond the end of the array. it expects 2^log2n samples
vDSP_fft_zrip's output is packed and your calculations expect unpacked

Swift 4 version now with actual fix for point 3

let fileURL = Bundle.main.url(forResource: "foo", withExtension: "mp3")!
let audioFile = try!  AVAudioFile(forReading: fileURL as URL)
let frameCount = UInt32(audioFile.length)

let log2n = UInt(round(log2(Double(frameCount))))
let bufferSizePOT = Int(1 << log2n)

let buffer = AVAudioPCMBuffer(pcmFormat: audioFile.processingFormat, frameCapacity: AVAudioFrameCount(bufferSizePOT))!
try! audioFile.read(into: buffer, frameCount:frameCount)

// Not sure if AVAudioPCMBuffer zero initialises extra frames, so when in doubt...
let leftFrames = buffer.floatChannelData![0]
for i in Int(frameCount)..<Int(bufferSizePOT) {
    leftFrames[i] = 0
}

// Set up the transform
let fftSetup = vDSP_create_fftsetup(log2n, Int32(kFFTRadix2))!

// create packed real input
var realp = [Float](repeating: 0, count: bufferSizePOT/2)
var imagp = [Float](repeating: 0, count: bufferSizePOT/2)
var output = DSPSplitComplex(realp: &realp, imagp: &imagp)

leftFrames.withMemoryRebound(to: DSPComplex.self, capacity: bufferSizePOT / 2) {
    vDSP_ctoz($0, 2, &output, 1, UInt(bufferSizePOT / 2))
}

// Do the fast Fourier forward transform, packed input to packed output
vDSP_fft_zrip(fftSetup, &output, 1, log2n, Int32(FFT_FORWARD))

// you can calculate magnitude squared here, with care
// as the first result is wrong! read up on packed formats
var fft = [Float](repeating:0.0, count:Int(bufferSizePOT / 2))
vDSP_zvmags(&output, 1, &fft, 1, vDSP_Length(bufferSizePOT / 2))

// Release the setup
vDSP_destroy_fftsetup(fftSetup)

Using the Apple FFT and Accelerate Framework

I just got the FFT code working for an iPhone project:

create a new project
delete all the files except for main.m and xxx_info.plist
going to project settings and search for pch and stop it from trying to load a .pch (seeing as we have just deleted it)
copy paste the code example over whatever you have in main.m
remove the line that #include's Carbon. Carbon is for OSX.
delete all the frameworks, and add accelerate framework

You might also need to remove an entry from info.plist that tells the project to load a xib, but I'm 90% sure you don't need to bother with that.

NOTE: Program outputs to console, results come out as 0.000 that's not an error –- it's just very very fast

This code is really stupidly obscure; it is generously commented, but the comments don't actually make life any easier.

Basically at the heart of it is:

vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_FORWARD);
vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_INVERSE);

FFT on n real floats, and then reverse to get back to where we started.
ip stands for in-place, which means &A gets overwritten
That's the reason for all this special packing malarkey -- so that we can squash the return value into the same space as the send value.

To give some perspective (like, as in: why would we be using this function in the first place?), Let's say we want to perform pitch detection on microphone input, and we have set it up so that some callback gets triggered every time the microphone gets in 1024 floats. Supposing the microphone sampling rate was 44.1kHz, so that's ~44 frames / sec.

So, our time-window is whatever the time duration of 1024 samples is, ie 1/44 s.

So we would pack A with 1024 floats from the mic, set log2n=10 (2^10=1024), precalculate some bobbins (setupReal) and:

vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_FORWARD);

Now A will contain n/2 complex numbers. These represent n/2 frequency bins:

bin[1].idealFreq = 44Hz -- ie The lowest frequency we can reliably detect is ONE complete wave within that window, ie a 44Hz wave.
bin[2].idealFreq = 2 * 44Hz
etc.
bin[512].idealFreq = 512 * 44Hz -- The highest frequency we can detect (known as the Nyquist frequency) is where every pair of points represents a wave, ie 512 complete waves within the window, ie 512 * 44Hz, or: n/2 * bin[1].idealFreq
Actually there is an extra Bin, Bin[0] which is often referred to as 'DC Offset'. It so happens that Bin[0] and Bin[n/2] will always have complex component 0, so A[0].realp is used to store Bin[0] and A[0].imagp is used to store Bin[n/2]

And the magnitude of each complex number is the amount of energy vibrating around that frequency.

So, as you can see, it wouldn't be a very great pitch detector as it doesn't have nearly fine enough granularity. There is a cunning trick Extracting precise frequencies from FFT Bins using phase change between frames to get the precise frequency for a given bin.

Ok, Now onto the code:

Note the 'ip' in vDSP_fft_zrip, = 'in place' ie output overwrites A ('r' means it takes real inputs)

Look at the documentation on vDSP_fft_zrip,

Real data is stored in split complex
form, with odd reals stored on the
imaginary side of the split complex
form and even reals in stored on the
real side.

this is probably the hardest thing to understand. We are using the same container (&A) all the way through the process. so in the beginning we want to fill it with n real numbers. after the FFT it is going to be holding n/2 complex numbers. we then throw that into the inverse transform, and hopefully get out our original n real numbers.

now the structure of A its setup for complex values. So vDSP needs to standardise how to pack real numbers into it.

so first we generate n real numbers: 1, 2, ..., n

for (i = 0; i < n; i++)
    originalReal[i] = (float) (i + 1);

Next we pack them into A as n/2 complex #s:

// 1. masquerades n real #s as n/2 complex #s = {1+2i, 3+4i, ...}
// 2. splits to 
//   A.realP = {1,3,...} (n/2 elts)
//   A.compP = {2,4,...} (n/2 elts)
//
vDSP_ctoz(
          (COMPLEX *) originalReal, 
          2,                            // stride 2, as each complex # is 2 floats
          &A, 
          1,                            // stride 1 in A.realP & .compP
          nOver2);                      // n/2 elts

You would really need to look at how A is allocated to get this, maybe look up COMPLEX_SPLIT in the documentation.

A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));

Next we do a pre-calculation.

Quick DSP class for maths bods:
Fourier theory takes a long time to get your head around (I've been looking at it on and off for several years now)

A cisoid is:

z = exp(i.theta) = cos(theta) + i.sin(theta)

i.e. a point on the unit circle in the complex plane.

When you multiply complex numbers, the angles add. So z^k will keep hopping around the unit circle; z^k can be found at an angle k.theta

Choose z1 = 0+1i, i.e. a quarter turn from the real axis, and notice that z1^2 z1^3 z1^4 each give another quarter turn so that z1^4 = 1
Choose z2 = -1, i.e. a half-turn. also z2^4 = 1 but z2 has completed 2 cycles at this point (z2^2 is also = 1). So you could think of z1 as the fundamental frequency and z2 as the first harmonic
Similarly z3 = the 'three-quarters of a revolution' point i.e. -i completes exactly 3 cycles, but actually going forwards 3/4 each time is the same as going backwards 1/4 each time

i.e. z3 is just z1 but in the opposite direction -- It's called aliasing

z2 is the highest meaningful frequency, as we chose 4 samples to hold a full wave.

z0 = 1+0i, z0^(anything)=1, this is DC offset

You can express any 4-point signal as a linear combination of z0 z1 and z2
i.e. You're projecting it onto these basis vectors

but I hear you asking "what does it mean to project a signal onto a cisoid?"

You can think of it this way: The needle spins round the cisoid, so at sample k, the needle is pointing in direction k.theta, and the length is signal[k]. A signal that matches the frequency of the cisoid exactly will bulge the resulting shape in some direction. So if you add up all the contributions, you'll get a strong resultant vector.
If the frequency nearly matches, than the bulge will be smaller and will move slowly around the circle.
For a signal that doesn't match frequency, the contributions will cancel one another out.

http://complextoreal.com/tutorials/tutorial-4-fourier-analysis-made-easy-part-1/ will help you get an intuitive understanding.

But the gist is; if we have chosen to project 1024 samples onto {z0,...,z512} we would have precalculate z0 thru z512, and that's what this precalculation step is.

Note that if you are doing this in real code you probably want to do this once when the app loads and call the complementary release function once when it quits. DON'T do it lots of times -- it is expensive.

// let's say log2n = 8, so n=2^8=256 samples, or 'harmonics' or 'terms'
// if we pre-calculate the 256th roots of unity (of which there are 256) 
// that will save us time later.
//
// Note that this call creates an array which will need to be released 
// later to avoid leaking
setupReal = vDSP_create_fftsetup(log2n, FFT_RADIX2);

It's worth noting that if we set log2n to eg 8, you can throw these precalculated values into any fft function that uses resolution <= 2^8. So (unless you want ultimate memory optimisation) just create one set for the highest resolution you're going to need, and use it for everything.

Now the actual transforms, making use of the stuff we just precalculated:

vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_FORWARD);

At this point A will be containing n/2 complex numbers, only the first one is actually two real numbers (DC offset, Nyquist #) masquerading as a complex number. The documentation overview explains this packing. It is quite neat -- basically it allows the (complex) results of the transform to be packed into the same memory footprint as the (real, but weirdly packaged) inputs.

vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_INVERSE);

and back again... we will still need to unpack our original array from A. then we compare just to check that we have got back exactly what we started out with, release our precalculated bobbins and done!

But wait! before you unpack, there is one final thing that needs to be done:

// Need to see the documentation for this one...
// in order to optimise, different routines return values 
// that need to be scaled by different amounts in order to 
// be correct as per the math
// In this case...
scale = (float) 1.0 / (2 * n);

vDSP_vsmul(A.realp, 1, &scale, A.realp, 1, nOver2);
vDSP_vsmul(A.imagp, 1, &scale, A.imagp, 1, nOver2);

FFT on iPhone to ignore background noise and find lower pitches

Pitch is not the same as peak magnitude frequency bin (which is what the FFT in the Accelerate framework might give you directly). So any peak frequency detector will not be reliable for pitch estimation. A low-pass filter will not help when the note has a missing or very weak fundamental (common in some voice, piano and guitar sounds) and/or lots of powerful overtones in its spectrum.

Look at a wide-band spectrum or spectrograph of your musical sounds and you will see the problem.

Other methods are usually needed for a more reliable estimate of musical pitch. Some of these include autocorrelation methods (AMDF, ASDF), Cepstrum/Cepstral analysis, harmonic product spectrum, phase vocoder, and/or composite algorithms such as RAPT (Robust Algorithm for Pitch Tracking) and YAAPT. An FFT is useful as only a sub-part of some of the above methods.

How to calculate the energy per bin in a DFT?

Your outline above looks on point ... to calculate the magnitude for a given bin

mag = 2.0 * math.Sqrt(real*real+imag*imag) / number_of_samples

where number_of_samples is the length of array fed into the fft call ... beautiful aspect of doing a fft is you can then apply the inverse Fourier Transform on that set of ( freq, magnitude, phase shift) to return back your source time domain signal ... doing this is a nice way to validate your process is correct

Magic of Fourier Transform and the Inverse Fourier Transform - an example :

you start with a floating point array which represents something which wobbles like audio, stock market index or any time-series ... this is the time domain representation since its a set of points on a curve where time is your left to right X axis and the up and down Y axis is the height of the curve ... then you feed this array into a fft api call which will return back to you the same information in its frequency domain representation ... its the same information just in a different representation ... in the freq domain you will have an array where element 0 is always frequency of 0 cycles per second (DC offset) then as you iterate across the array you increment freq using formula

incr_freq := sample_rate / number_of_samples

so in the complex array generated by the fft call each element is the data for a given frequency where each element is simply a complex number ... in simple terms this freq domain representation is just a set of frequencies, each freq embodied by a complex number (A + Bi) which can be used to calc that freq's magnitude and phase shift

now is the interesting part ... if you send this freq domain array into an inverse Fourier Transform you will get back your original data which is in time domain representation

myAudio_TD ( time domain ) --> send to fft --> myAudio_FD ( freq domain )

then later you are free to do the reverse as in

myAudio_FD ( freq domain ) --> send to inverse fft --> myAudio_TD ( time domain )

notice in that progression you started with an array myAudio_TD which got sent into an fft call then into an inverse fft call which magically returns back to you your original myAudio_TD

to see the full parsing of the complex array returned from fft call which includes notion of Nyquist Limit see Get frequency with highest amplitude from FFT

how to correctly pad 2D array for FFT with iOS Accelerate framework

You are likely doing the zero-padding correctly. Any expectation that the results will be the same after zero-padding is incorrect. All the FFT bin frequencies will instead be related to the larger sized array.

Need help understanding this line in an FFT algorithm

But there's one thing I don't get: this is a length-N DFT, and yet only N/2 twiddle factors are ever calculated. Shouldn't the array be of length N, and the modulo should be by N?

The twiddle factors are equally spaced points on the unit circle, and there is an even number of points because N is a power-of-two. After going around half of the circle (starting at 1, going counter clockwise above the X-axis), the second half is a repeat of the first half but this time it's below the X-axis (the points can be reflected through the origin). That is why Temp is subtracted the second time. That subtraction is the negation of the twiddle factor.

FFT and decibel scales

The simple answer is that, for the most part, you can add an arbitrary number to the dB value to make the values all positive, or all negative, or whatever you prefer. With an uncalibrated microphone, like on the iPhone, this is all that makes sense anyway, since all you know are relative values.

For a more advanced technical approach, using a calibrated microphone, you could reference everything using dB (SPL), as a reasonable standard, but this is a hassle, and not meaningful in your use case anyway.

Rationale:

The main reason that shifting by an arbitrary amount is that the log doesn't report the units of measurement. For example, even if you know the input amplitude is 0.1 Pascal, it's completely valid to say this is 100 milliPascal, where you'd be taking the log of 100 rather than 0.1 (so the log values are either 2 or -1). Both are completely valid and the choice is entirely arbitrary. When comparing to a standard reference, as in dB SPL, note that it's done as a ratio, log(P/P_ref), removing the impact of changing the units.