In the past 50 years audio production has moved from analogue to digital, and from expensive recording studios to laptops in bedrooms. Audio production is now far more accessible, but the digital revolution didn’t do much, if anything, to change audio production practices. I say this because in almost every sense, modern digital tools are emulations of their analogue predecessors with comparable capabilities and controls. Our modern DAW gives us much better data management for arranging and editing audio, and we can play almost any virtual instrument at the touch of a button; but it still plays audio via a virtual “tape-head”, and our processing tools are still controlled using parameters taking from old analogue circuitry. During the past 7 years the main goal of my research has been to instigate some real change in this field. A new style Bakelite nob that gets stuck just like the analogue original is not real innovation, and while I totally understand the desire to capture thecharacteristics of analogue processors, isn’t it about time that we thought about completely new ways to use our audio technology? Yes…it is.
My primary motivation stems from interactions with amateur music producers who know what they want to do, but who simply don’t have the technical skill to do it. If you don’t like maths or science then it’s hard to understand (and control) a compressor, but it’s not hard to know that it makes the loud bits quieter, or to appreciate why you need to use it to level out your vocal recording. Making tools with controls that the user understands is what I’ve been trying do, and I’d argue that I’ve had some success along the way! I’ve always called this area of research Perceptual Audio Production, because the goal has been to change the interface between the user and the technology in such a way as to give control direct control over perceptual characteristics of the mix. Rather than controlling gain we control loudness, and rather than frequency, gain and Q we control brightness, warmth and presence. I’ll go into this in far more detail later, but for now its worth nothing that perceptualaudio production falls under the umbrella of a broader field, which is most commonly referred to as Intelligent Audio Production, or more succinctly as AI Audio Production.
Before we get into some detail I thought it worth covering a very brief history of the field. The first intelligent tools were developed to automate tasks that were difficult for a human to do. One such task is controlling multiple microphones in a conference setting, e.g. when you have 20 people in a room, each with their own microphones, and a set of loudspeakers to reproduce the sound. If all of the microphones are turned up at the same time you’ll get acoustic feedback, but if you turn them down too low you won’t be able to hear what anyone is saying. The solution is to only turn up those microphones that are in use, but this is tricky for a mixing engineer who must anticipate who is going to speak, before manually selecting and moving the correct fader from a large set. Two solutions were developed in the 70s by Dugan [1,2] and Julstrom and Titchy [3], both of which would automatically turn-off those microphones that were not in active use.
The concept of Automatic Mixing was first applied to music by Enrique Perez-Gonzalez while studying for his Ph.D in the Centre for Digital Music () at Queen Mary University of London in the mid/late 00s. I was lucky enough to start my Ph.D a year or so after Enrique (also at C4DM), and he was a great source of help and advice in those early years. Enrique developed a number of automatic mixing tools for live music applications [4-9] including: automatic gain normalisation, and automatic algorithms to control faders, panners and equalisers. Since then research has diverged down numerous paths (including my own work), and we can think of Automatic Audio Production as just one branch of AI Audio Production. However, when it comes to being the catalyst for the entire field of study, Enrique must be given huge credit for his initial inventive steps.
was the first commercial venture in this space, and was directly based upon Enrique’s automatic mixing work (though he was not involved). The company was founded in 2012 by his Ph.D supervisor and Queen Mary University, with backing from a Canadian VC (). The company have gone through a number of changes since then, switching their focus to mastering and their name to . A number of other companies have popped up in this space, including , who offer an almost identical service to LandR. I also had my own short-lived enterprise, MixElephant, which started in Jan 2014 with the development of an intelligent multitrack mixing system. Unfortunately this was shut down in June 2014 in favour of joining LandR (I resigned a little over a year later…long story!).
So what’s next for this field? It’s probably worth working on our definition of AI audio production before we think about what might come next.
The most obvious differentiator with traditional tools is the fact they they contain some kind of “intelligence”. But what on earth does this mean?
I’d say that there are two key facets to intelligence in this context. Firstly, intelligent tools are able to take responsibility for some part of our audio production task, and so they reduce the burden on the engineer. Secondly, in order to take on this burden they must be smart enough to adapt to different audio signals and to different situations. To emphasise the point let’s consider a simple audio effect such as an equaliser. An equaliser will change the frequency content of a sound, but the way it changes the sound is entirely set by the audio engineer, and it is not dependent on the input signal. Clearly then, a traditional equaliser is not an intelligent tool. But what about a compressor? Before compressors, the only way to reduce the dynamic range of an audio signal was to manually ride a fader. Clearly this is pretty labour intensive and is prone to human error, so compressors were developed to help [10]. The user still has control over how much the dynamic range isaffected, but the compressor takes responsibility for apply a time-varying gain, which is equivalent to our mixing engineer riding the fader. Compressors are also inherently adaptive, because the gain applied depends directly on the level of the input signal. The compressor is therefore an early example of an intelligent audio production tool, and even though this is a fairly low-level example, it goes to show that this “cutting edge” field has been around for a lot longer than many people think.
Before going into more detail I’ll diverge a little and discuss what happens when mixing a song. You will sit in front of your DAW, listen to your song, build up a picture of how the mix should sound, then work out how to make it sound that way. Sure, your picture may evolve over time, and you might have a sudden insight that changes your plans, but in general you’ll have an objective in mind and will be playing with your audio tools to make your mix meet that objective. The only alternative is that you randomly change the controls and wait (and hope) for it to sound good. In order to understand intelligent audio tools we need to break down the workflow into two parts: (i) deciding how our mix should sound (working out our objective), and (ii) making it sound that way (hitting our objective). I completely agree that these two aspects may rotate, swap around, iterate, and interact, but hopefully you can agree that they are separable…at least conceptually! A second way to visualise thisseparation is to think of a typical scenario where a producer and a mixing engineer are working in a studio. The producer will be issuing requests: make the bass sound fatter, the snare needs to be clearer, I can’t understand the vocals etc etc, and it’s then the mixing engineer’s job to make this happen. The producer doesn’t really care how it’s achieved so long as it’s done, so there is a clear separation of two roles: (i) the producer sets some mixing objectives, and (ii) the mixing engineer manipulates the audio processing tools to meet those objectives.
An AI audio production tool provides this same separation. It allows the user to set the objective of a music production task, and its internal algorithm is responsible for making sure that the objective is met. With a compressor the user is setting objectives related to the dynamic range, but it is the innards of the tool that work out what gain to apply on a sample-by-sample basis to meet that objective. This, as stated above, is a low-level task, but we can apply the same principle to any part of music production, and tools have been made that allow the user to set higher-level objectives relating to: the loudness balance of a mix, the amount of masking between tracks in a mix, the perceived strength of a noise gate, timbre features of musical mixtures, and many many more (see ). I see no limit on what can be tackled with this approach and would argue that so long as we can describe and model an audio production concept, then we can control it using an intelligent tool.
The basic components of an AI audio production tool are shown below. We have a sound feature model that gives us a way to describe the objective of our music production task, and we have a control algorithm that works out how to set the audio processors to meet our objective. I’ll expand on this shortly.
Let’s look at this picture in a slightly different way as below, ignoring the audio processors which are hopefully familiar to all. I’ve added the user input to our intelligent tool, which as discussed above is analogous the the producer in a studio. The producer provides some objective, e.g. make the guitar brighter, and it is the mixing engineer’s job to do so. In the real-life situation both the producer and the engineer understand what is meant by the term brighter, i.e. they have a mental model that can provide a quantitative assessment of the brightness of the guitar. If we want to make intelligent audio production tools we need equivalent models of the sound features we use within our objectives. If we want to set a brightness objective we need a model of brightness, if we want to set a loudness objective we need a model of loudness, if we want to set an objective on the vocal clarity then we need a model of vocal clarity etc etc. If you can model it, then you can make anintelligent tool to control it…so long as you have a good control algorithm!
The engineer has received a specific music production objective, so how does he go about fulfilling it? Firstly he’ll use his knowledge of audio production to select those parameters in his audio processors that provide most control over the sound feature he needs to change. For the “make brighter” objective he’ll probably choose a high-ish band on his favourite equaliser, and then experiment with the gain until he’s happy that he’s added enough brightness. This is exactly the process that we need to replicate with our control algorithm, i.e. we need to develop an algorithm that will select a band on our equaliser, and then manipulate its gain to increase the brightness to the desired level. For all but the most trivial of cases this will be done using some kind of numerical optimisation algorithm, which will be covered in detail elsewhere on this site.
To summarise, we can think of the user as providing an intelligent audio production tool with a music production objective, described in terms of a sound feature, e.g. make the guitar “this” bright. The tool contains a model of this sound feature, and via a control algorithm it works out how to set the parameters on the audio processor to make the guitar “that” bright. Simple as that!
So where does automatic audio production lie within the broader field of AI audio production?
The additional component that distinguishes automatic audio production from other types of AI audio production is a means to define the objectives that would normally be made by the producer (or user). In other words, we have an additional model that looks at the incoming signals and automatically works out suitable music production objectives. There are many possible ways to do this, and I’ll discuss specifics elsewhere on this site. However, it’s fair to say that non-trivial approaches employ machine learning techniques to predict these objectives, e.g. the system will analyse an aggressive EDM song and decide that based on all other songs in this genre it should be ridiculously loud.
So is automatic audio production a good idea?
I think that in the right circumstances it makes sense, e.g. if I wanted to remix or remaster a huge back catalogue of material, or if I was developing a system where the audio was secondary to some other component, e.g. in gaming or in an automatic music composition system (check out !). But for general audio production, and in particular music, I’m not a huge fan. It’s not a case that I don’t believe they can do a ok job, because I’m sure that someday they will (although I think we’re a long way from a pro-standard fully automatic system). My main reason is a belief that most people enjoy the music production process, and rather than being a means to and end it is part of the appeal in itself. To be creative in this way is fun, and I’d rather facilitate this process than wipe it out completely. Perhaps if I wanted to knock up a quick demo then an automatic first draft would be useful, but there are better and smarter ways to achieve this with more general intelligent tools, e.g. byhaving personal presets that you develop yourself over time. It’s also true that making a fully automatic system is more difficult, and I’d argue that a lot of the energy that goes into making it this way is wasted. For example, why waste time making a genre recognition system when it would take the user 5 seconds to provide this information himself? Sure, it means you can shout about how great your machine learning is, but in terms of a user experience and system complexity the benefits don’t outweigh the costs.