Adventures in modern restoration

Published: Jan 7, 2016 by luxagen

Back in the late ’90s, CDs were still relatively expensive and back catalogues much less complete than today. Meanwhile, I was getting into digital audio on my PC via Cool Edit, so I did the occasional tape/vinyl restoration project. The noise-reduction tools available at the consumer-software level weren’t that great – as far as I can tell, they still aren’t – and at first I wasn’t too skilled at using them either, so I muddled along as best I could, and it gave me a chance to refine the listening and equalising skills I’d first learned trying to make crappy £12 earbuds sound tonally neutral.

When I talk about “noise reduction” or NR in this article, I’m mostly referring to tools designed to profile and attenuate static rumble and hiss, which remains the cornerstone (along with equalisation) of most remastering efforts.

What did I learn from restoration? The recurring theme is that it’s an engineering tradeoff between removing flaws and doing further damage to the signal you care about.

Speech

Nowhere is this tradeoff clearer than in spoken-word material: the human auditory system is so incredibly optimised for speech comprehension that the assumption that you need to use noise-reduction processing at all is often wrong. In fact, it’s very difficult to noise-reduce a hiss-filled speech recording without actively harming its intelligibility, as all kinds of low-amplitude details like sibilance and glottal stops tend to suffer.

If you want to produce a transcript from a noisy speech recording, the best approach is very often to give up on noise reduction altogether, do some equalisation, and listen through the noise – having highly-trained ears helps a lot too. This applies far more to hiss than tonal noise, of course – if a deafening ground-loop hum is drowning everything else out, mitigating it with NR or notch filtering will help a lot – but leave the broadband hiss well alone.

Music

Before facing the perils of speech, I restored music from vinyl and cassette tape. Music is both much more forgiving of noise-reduction processing than speech, and much more pleasing to hear with a low noise floor.

I started out following the transfer process with the usual noise reduction, click/pop elimination (for vinyl), and equalisation. The first lesson I learned was that equalisation should be done before noise reduction – after all, you can’t properly assess your noise-reduction settings when the high frequencies it will most drastically affect are at so low a level that you can’t even hear them.

The main challenge of equalising treble-challenged analog sources without doing NR first is that one tends to confuse the “bright” sound of the hiss itself with the desired brightness in the signal of interest. Both vinyl and especially cassette tape also require careful bass EQ to remove the boxy-sounding bass boosts that both of these formats add to the sound. As far as I know, practice is the only antidote to these problems.

Another issue that’s obvious in hindsight is that you really hear what’s wrong with your NR settings in quiet intros and end-of-song fades, and if you want to hear the worst of the burbling, you have to listen at very high levels. Don’t get me wrong – I’m generally extremely careful not to expose my hearing to potential sources of damage – but this is one of those situations where it can’t be helped. In those days, the initial steps of my restoration projects would be followed by hours of adjusting the NR settings and listening, in short bursts, to fade tails at full blast to make sure I was minimising audible artefacts.

Transfers

The first couple of restoration projects were transferred at 44.1 kHz and 16 bits, but I quickly learned that the consumer-level D/A convertors of the time (specifically on my Turtle Beach Santa Cruz card) just didn’t image very well at 44.1 kHz – everything was smeared over half the pan spectrum. The difference at 48 kHz was astonishing: it was much closer to the “analog” clarity and presence I could hear on some of my vinyl-derived cassette tapes of the time.

I was a bit more selective about 24-bit recording. In those days, storage space was a severe constraint, both during restoration and when archiving to optical disc, so for audio cassettes I’d often stick to 16-bit transfers; since even a -96 dB noise floor was far, far below that of the tape material, the signal would be self-dithering and I wouldn’t need to worry about the innards of the sound card so much. Meanwhile, keeping the restoration processing in floating point (by converting to FP as the first step of the workflow, in Audition’s case) assured no change in the quality of the results.

As time went on, CPU and storage-space constraints became less severe, so I graduated to doing transfers at 96 kHz for two reasons: first, if bumping up to 48 kHz improved the clarity of the signal so much on the affordable convertors of the time, 96 kHz had to be even better, right? Second, restoration tends to involve nonlinear processing, whether noise reduction or simple compression/gating, and it seemed plausible to me that using a higher sample-rate might improve transparency there. I’d still do the transfers at 16-bit resolution where appropriate, but these days I don’t even bother with that; storage is so plentiful that all transfers are 24-bit, even if I still often archive cassette transfers at 16-bit.

The key to transfers is not to be clever in any way. Permanently normalising transfers in PCM is worse than pointless, for a start: it’ll just add more quantisation noise and risk harmonic distortion through improper dithering. Cropping transfers is similarly pointless because it never saves you any significant storage space and destroys material that you might need for restoration, not least periods of signal-free background noise useful for NR profiling.

I did go through a phase of boosting material by an integer factor to help in monitoring during Adobe-Audition-based workflows, because multiplication by a whole number will never generate more numerical precision at the low end of the sample word and is therefore degradation-free. When I did this, it was always with a tool where I could give a direct multiple rather than using decibels, because dB values for rational numbers not divisible by 10 are irrational – they have infinitely long decimal expansions and can’t be specified precisely.

If you’re using lossless compression for your archives, though, be warned that any baked-in amplitude boosting, even by integer factors, risks spoiling your compression ratio. Although some lossless formats will ignore low-order zeroes, allowing you to safely boost by powers of two, non-power-of-two factors are unlikely to be factored out during encoding because they cause requantisation.

In the end, there’s a powerful argument against baking any processing into your archived transfers: you never know when you might want to revisit an old project with better tools. In my case, even baking in EQ was a bad idea because my early equalisation wasn’t very skilled, and I eventually switched from graph-based FFT filtering to parametric EQ anyway.

For all these reasons, I long ago stopped doing anything to transfers between recording and archiving; my transfer process is to record whole sides, starting recording before the record/tape is started, and stopping recording only once the side is over. Once that’s done, I have the transfer – that’s it. REAPER-based nondestructive workflows help with this by allowing me to preview everything live, and I now reserve gain adjustments for the final mastering step where they belong.

Surface noise

A vinyl-specific issue that starkly demonstrates the tradeoff between removing unwanted noise and damaging the signal of interest is the old wet/dry debate. Wet playing is a controversial transfer technique – by which I mean that there’s an endless river of baseless FUD spread by irrational haters – where you layer the surface of the record with a lubricant during playback to mitigate surface noise.

I use weak detergent solution, and I imagine it works through a combination of letting the needle skate over gouges and suspending grit so that the needle doesn’t have to ride over it. The wet-playing technique requires constant supervision and care in maintaining the liquid layer to compensate for evaporation without drowning the needle or overflowing the edge of the turntable. Although it’s possible that wet playing does more damage in the long run, I haven’t yet seen convincing evidence to back up any of the wacky theories on this subject, and my main concern is to get the best possible transfer anyway.

It seems likely that wet playing would have horrible effects on sound quality, but in reality it only dulls the treble and transients a little and gives the sound a slightly “watery” quality. The benefits in reducing surface noise, on the other hand, are nothing short of miraculous.

In those days my faith in the magic of DSP was a little too great, so for a while I was a purist and stuck to dry playing in combination with digital processing, but I quickly found digital click/pop elimination to be very unsatisfactory: removing any significant amount of crackle would cause severe damage to certain types of signal. Rickie Lee Jones’s debut album illustrated this well: Tom Scott’s horn arrangements attracted far too much attention from Cool Edit’s click/pop eliminator, causing audible distortion. It was then that I decided that, whatever downsides wet playing had, its overall payoff was massive compared to the hamfisted blunderings of digital pop reduction.

These kinds of digital tools have undoubtedly advanced a lot since then, but it still makes sense to me that reliably discriminating valid signal transients from record pops is basically impossible. Vinyl surface noise is just too variable in both level and timbre to usefully profile, and since it’s often intentionally added rather than unwanted (ask any producer), there will always be a need for moment-to-moment supervision of this kind of tooling.

Noise reduction

This line of argument doesn’t apply to rumble and hiss because they tend to have very reliable overall statistics that can be analysed using a reference sample of pure noise. Any added hiss that’s part of the valid signal, and lies at least 12 dB or so above the unwanted noise floor, will be relatively easy for NR software to separate and preserve.

Reducing hiss (static noise) is relatively simple to explain: the basic trick is to divide the audio spectrum into channels by windowing data blocks and applying the Fast Fourier Transform (a class of algorithms that efficiently implement the Discrete Fourier Transform). When blocks are overlapped to compensate for the information loss caused by windowing, you have something known as a DFT filterbank: a set of overlapping, evenly-spaced “channel” filters with identically-shaped frequency responses dictated by the window function, each of which extracts a complex-valued narrowband signal from the overall one.

So far, so technical; the point is that this is a half-decent analogue to the physical mechanics of human hearing as found in the cochlea – our ears also analyse in terms of a large number of narrow channels, which is how we’re able to so finely discriminate tones from each other and the surrounding noise – and it’s completely reversible to get the original signal back: all you have to do is inverse-FFT each block and overlap/mix them.

The filterbank representation, whether arrived at through a Fast Fourier Transform or the cochlea’s microelectromechanical systems, is a dream for isolating tones from the rest of a signal because they show up as sharp, strong, static peaks. The major drawback is that filterbank outputs are the worst possible representation for transients, and the cochlea’s short-term responsiveness to changes in the incoming signal is therefore horrible by any sensible measure.

Once the auditory-nerve signals from the cochlea have been decoded, an important stage of the neurological portion of the human auditory system involves reassembling transients and diverting them through separate signal paths for further processing.

This transient separation is the part that tends to be missing from noise-reduction software, mainly because it’s hard and requires a really solid grasp of signal-processing principles. NR software will usually push each channel’s complex signal through a noise-gate processor before reassembly, each gate’s threshold being derived from the corresponding part of the noise profile. As long as transients are loud enough to stand out from background noise at most frequencies, they’ll reassemble correctly in the end, but they tend to easily get swamped by noise for the same mathematical reason that tones stand out from noise, and the effect gets worse the longer the DFT blocks are, murdering low-level transients with extreme prejudice.

The shorter the DFT blocks you use to analyse the same signal, the more the noise floor will appear to rise and compete with tones. This progressively worsens “burbling” effects as the NR software mistakenly scrubs away more and more of the tonal elements to get rid of the noise.

Although transients are also continuous in frequency, and will therefore spread over a filterbank’s channels in a similar way to noise, they’re much more localised in time. The longer the DFT blocks one uses, the more of the tonal/noisy signal surrounding each transient is transformed with it, making it harder to detect and discriminate.

The upshot of all this is that adjusting NR software is a nasty tradeoff between preserving transients and tones; the degree of static-noise reduction you can get without affecting either too badly will generally be out of your control. As window functions are rarely adjustable in such software, FFT size is critical, and I’ve found FFT blocks between one hundredth and one tenth of a second to be most useful for restoration.

A whole tenth of a second (say 8192 points at 44.1 or 48 kHz) is for desperate situations where you need to polish the tonal signal elements as much as possible at the cost of pretty much obliterating transients. At the other end of the scale, blocks a hundredth of a second long are really quite bad at isolating tones from noise but minimise transient damage.

In my restoration work from cassette tape, I found 6000 points (~125 milliseconds) to be the sweet spot – any smaller and there wasn’t enough frequency resolution to scrub out the relatively high noise floor, and much larger would make the whole thing sound too washy by swamping everything with pre-/post-echo.

One of the timeless truths about NR processing is that the higher the original signal-to-noise ratio, the better the results. This explains why remastered CDs sound so good: although professional-grade NR technology is undoubtedly a cut above what we have access to with tools like Audition and Audacity, it’s not magical – it also performs way better on reel-to-reel tape transfers because that medium is the very best of analog audio technology, sporting a signal-to-noise ratio of around 60 dB.

Vinyl isn’t as good as reel-to-reel, but the levels of static noise on a good vinyl pressing are still a fair bit lower than on cassette tape, so vinyl restoration can use slightly shorter FFT blocks without compromising tones too badly.

The final element of the FFT-filterbank downward-expander design is the attack and decay settings. A major cause of “metallic” decays in noise reduction is that tonal elements that fade away, as almost every natural musical sound does, will have their harmonics abruptly disappear, one by one, as each appears to sink below the noise floor. The more that noise floor appears to rise through the use of short blocks, the worse this problem gets.

Attack and decay settings help to mitigate this by smoothing channel amplitudes in successive blocks on the way into the expanders; this smooths out the statistics of static noise a bit, improving separation from tones. It also smooths the starts and ends of tonal signals, rendering the truncation effect less abrupt and offensive. If you have these settings available, you can generally use shorter FFT blocks and make up the difference with longer decay/release timing, but beware of using slow release times as a panacea: if you overdo it, transients will end up with an accompanying “shadow” of background noise that sounds a lot like reverb. This might not matter if you’re working on a track from the ’80s, of course, so in the end all such adjustments are material-dependent to some degree – this is one of the reasons that digital remixes using individually-restored multitracks sound even better than the stereo remasters we see the vast majority of the time.

My workflow these days tends to be much more transient-focused than in the past. I’ll try to avoid using blocks longer than a fiftieth of a second and adjust the expander-release time to best preserve decays without letting too much “shadow” noise through, while setting the attack time to the minimum setting to preserve transients. What I’m really doing here is letting the FFT window itself determine the attack time, and adjusting the attack indirectly by tuning the FFT size.

Dolby noise reduction

Dolby is a potential problem for restoration. In short, it’s a two-sided “companding” system like that found in many communications systems with poor noise characteristics, but works in the frequency domain as well, applying the most severe companding to high frequencies.

These systems work mainly because natural sounds have a falloff in amplitude, with respect to frequency, that threatens to bury their treble elements once flat noise is added. This is why manufactured cassette tapes are almost universally encoded with Dolby “B”, which you also have to enable on playback to avoid unnaturally grating and insistent treble and generally overcompressed sound.

Dolby is almost certainly designed to take the added tape noise into account, meaning that the encoder and decoder should use slightly different parameter sets. Worse, the thresholds in the Dolby compressor/expander probably make the whole process sensitive to absolute level.

The purist approach would be to do transfers at line level and use a software Dolby decoder, ideally with the ability to output a sampled gain signal for later use – we could then use a regular digital restoration workflow and multiply the gain signal in during mastering. In reality, there don’t seem to be too many of these software decoders about – I couldn’t find any with a quick Google – and it’s unlikely that they support generating a separate gain envelope anyway.

Rather than trying to untie this knot, I’ve always taken the pragmatic approach and done Dolby decoding on the tape deck during the transfer process. This will apply time-varying attenuation to the signal that affects the noise floor and interferes with digital noise reduction later on. Let’s look at this effect in more detail.

During the loudest passages, Dolby is least active because there’s not much headroom to boost signals into, so these sections will have the highest absolute noise levels after decoding. The quiet and signal-free regions, meanwhile, will see Dolby operating at maximum signal boost, minimising the amplitude of tape noise after decoding.

Since we capture noise profiles for digital NR during signal-free sections, a Dolby transfer will give us a conservative measurement that works well with quiet signals and stops working at high levels. This is actually acceptable because higher signal levels tend to correlate with “busy” frequency spectra that mask hiss well.

In short, decoding Dolby during the transfer process will give you quiet parts that are nicely denoised, and loud parts where you can’t hear the noise under all the signal. Given the (lack of) alternatives, I’ll take it.

Archiving

I generally use WinRAR for archiving to CD or DVD – in the early days, it was by far the most popular archiving tool that also had media optimisations, and it has neat data-redundancy features that were useful for protecting CD-ROM archives from the ravages of time and mould. I’d generate a RAR set with the “best” compression setting, spanned over enough small volumes that I could add an extra 10-20% in redundancy, and then write the whole thing off at medium speed (8x for a CD-ROM) because optical-disc writers tended to compromise burn strength if you ran them too fast.

The discs would be laid out with the redundancy volumes placed first: this was an optimisation for reading. Constant-angular-velocity optical drives of the type found in computers achieve their best data rates at the end (outer edge) of the disc, which is also the region most prone to degradation over time. My procedure therefore maximised read speed while storing the redundancy volumes in the more-robust centre. To restore an archive, I’d just put the disc in and directly extract the main RAR set. If any CRC errors occurred, I’d copy everything I could off the disc, including the redundancy volumes, repair the set, and burn a new disc.

Although this archiving process quickly exposed a major weakness in my workstation setup – the shocking quality of much computer memory – that story deserves a post in its own right, so I’ll skip it here.

As time went on, and lossless compression became a mature technology, I discovered David Bryant’s fantastic WavPack tool. Its “fast” mode achieves most of the lossless data reduction possible for a given recording without using any significant CPU power – so much so that I now use it everywhere, including inside heavily CPU-constrained recording projects – and unlike FLAC, it supports all widely-used sample formats including 32-bit integer and floating point. Once I started using WavPack, the only change to the WinRAR part of the workflow was to use it in “store” mode, turning its compression off.