Published: Jan 14, 2016 by luxagen
This post is the first in an occasional series in which I analyse the performance of DSP code in the wild to identify common quality problems.
Lately I’ve been polishing up a new product for offline, mastering-quality audio-format conversion. It’s more configurable than anything else out there, uses full 64-bit floating-point precision, performs correct rounding and dithering, and is very fast.
While testing, I did some comparisons with Adobe Audition’s resampler, monitoring via its frequency analyser, and found something interesting. Since Audition doesn’t support double-precision floating-point audio, and has numerical problems generating tones with periods of a whole number of samples, let’s have it generate a -6 dB single-precision tone at 1,234 Hz:
So far so good — the error floor is still quite uneven thanks to Audition’s failure to generate at higher accuracy and dither to single precision, but it’s good enough for our purposes. Let’s try running an FFT filter on the signal, in this case a narrow bandpass around the tone:
That actually made things worse: inharmonic distortion products are visible in the resulting signal, likely because Audition’s FFT-filtering code works directly in single precision instead of using extra accuracy and dithering back to single precision. Other Audition users might point out that there’s an option called “dither transform results” in the settings, but it seems to pertain to processing 8- or 16-bit integer data at 32-bit floating-point intermediate precision rather than improving the accuracy of processing for 32-bit floating-point data.
Now let’s construct a stereo test signal by putting the original one on the left channel and a string of pure zeroes on the right. As expected, this shows only a blue trace in Audition’s frequency analyser because the right channel, being pure zeroes, will have a level of negative-infinity decibels at all frequencies.
What happens if we use the same FFT filter on this stereo test signal?
Whoa — where did all that grit on the right (red) channel come from? Filtering is nothing more than multiplication and addition, so zeroes should beget more zeroes.
More puzzling, there’s clear crosstalk — you can see a 1.2 kHz peak on the right channel — yet the peaks of the red error floor match those on the left channel (blue). Even if the main signal is leaking slightly between channels (how does that even happen in a purely digital system?), why are the right channel’s errors at the same level as those of the left despite the much weaker signal? Floating-point representations tend to limit relative error, so seeing what looks like distortion that close in level to a signal is very strange.
In case you want to work it out for yourself, here’s a clue: the tone signal is at -6 dB, and its leakage onto the right channel lies at -148 dB; subtract those and you get a signal-to-crosstalk ratio of 142 dB.
Irreducible complexity
The Fourier Transform, whether discrete, continuous, or some mixture of the two, is best defined in terms of arithmetic on complex numbers, the nature (and beauty) of which we will ruthlessly gloss over here because it deserves a book of its own. Suffice it to say that:
- complex numbers occupy a two-dimensional plane rather than the “real” number line we’re used to;
- each complex number can be represented using two real numbers that give its co-ordinates on that plane;
- complex arithmetic corresponds to a set of simple geometrical operations in that plane.
The second point suggests an obvious representation in software: just use a pair of floating-point values. On their own, floating-point numbers have a degree of error proportional to their magnitude (i.e. their relative error is bounded), but this changes when they’re used for geometry because the co-ordinate values “mix” under rotation, spreading the largest absolute error to all co-ordinates. In general, then, complex arithmetic pollutes the real and imaginary parts of a number with similar-size errors regardless of their relative magnitude.
Why is all this relevant?
Keeping it real
Signal-processing engineers, most of the time, deal with real signals, the good old one-dimensional kind that we can easily grasp, graph, and use to represent air pressure or electric voltage/current. As already stated, though, the Fourier Transform proper works only in the space of complex numbers; if we allow only real numbers, we restrict ourselves to mirror-symmetrical signals and end up with the corresponding Cosine Transform instead.
The naïve solution
Since real numbers are a subset of complex numbers, any calculation based on real numbers can also be done with complex numbers: just set the imaginary parts of the input values to zero and discard the (ideally zero) imaginary parts of the outputs. The results will be as if you’d used real arithmetic all along.
This suggests a way of using the FFT for real signals — just make the input “complex” by adding zeroed imaginary parts, and discard the redundant imaginary parts of the output after transforming back to the time domain.
Doubling up
Having to dilute our input data with an equal number of zeroes and throw away half the output is a ridiculous waste, though. A better approach is to pack an extra signal into the unused imaginary parts instead of zeroing them. The Fourier Transform’s linearity guarantees that, as long as we process the frequency-domain data with the correct symmetry, the two will remain separate after the inverse transform.
The last mile
These days, any FFT library worth its salt will give you specialised “half-complex” versions of the transform and its inverse, designed to go from real-valued time signals to complex frequency spectra and back. Not only is there no need to pad the input signal with zeroes, but the redundant half of the frequency-domain representation is eliminated too. This might seem like nothing more than a nicety to save the programmer the effort of pairing up data blocks, but it actually outperforms the pairwise case by eliminating some work from the transform. The FFTW library follows this logic to its conclusion by including only these half-complex transforms in its precompiled code, and building complex FFTs out of them when necessary.
The big reveal
Audition’s core code was written in the dark ages of the mid ’90s, before ubiquitous Internet access, Wikipedia, decent search engines, and widespread code-sharing; if any general-purpose FFT libraries existed with MIT-style licences appropriate for closed-source commercial software, they would have been a lot harder to find than today, so it makes sense that Audition’s author rolled his own FFT function.
Note that I said “function”, singular: both the continuous and discrete Fourier transforms enjoy the property of being almost self-inverse in their fully-complex forms; as long as you don’t mind suboptimal performance, and with very little extra code, you can use a single function for both directions.
This is where the right channel’s crosstalk and distortion come from: when processing a stereo signal, Audition is packing both channels into a single complex data block and processing it with a complex FFT. Thanks to the aforementioned nature of absolute error in floating-point geometry, rounding during complex multiplication bleeds a tiny part of the left-channel signal into the right, and vice versa, while adding a new layer of cumulative error to both with a magnitude proportional to that of the stronger signal — the left one.
Remember that 142 dB signal-to-crosstalk ratio I mentioned earlier? Single-precision IEEE754 floating point has 23 explicit significand bits and an implicit leading one; taking the sign bit into account gives us a total of 25 significant bits. Assuming that the mean significand magnitude is 1.5, the mean signal-to-error ratio should therefore be 20*log10(1.5 / 2-25), that is 154.0 dB.
The two figures are suspiciously close — only 12 dB apart. At single precision, and with all the intermediate arithmetic in a large FFT, we’d expect more like a 30 dB accuracy loss, whereas if Audition were filtering at double precision internally, we wouldn’t expect to see anything at all on the right channel - despite the “mixing” effects of complex arithmetic, the crosstalk would be at too low a level (around -300 dB) to survive conversion back to single precision.
It seems fairly conclusive, then, that Audition is performing its FFT filtering directly in single precision. In stereo, it seems to use a single complex-to-complex implementation, with the magic of pairwise summation heavily suppressing the accumulation of rounding errors. It’s possible that a similar trick - packing consecutive pairs of blocks from the same signal into a single FFT block — is used to efficiently process mono signals, but that’s a lot harder to prove with a frequency analyser; far easier would be to compare run-time between equivalent mono and stereo signals.