A really compelling narrative running underneath all the other news of the world right now is the ways in which machine learning and neural networks are transforming data interpretation. And really, transforming everything. The core concept of machine learning, termed over fifty years ago now, is a process which “gives computers the ability to learn without being explicitly programmed.”
It’s done by feeding an algorithm (of which there are many types) mountains of data and watching it gradually start to sort it based on the commonalities it observes. We can steer this process (called “supervised” machine learning) by approving or rejecting certain of these categorizations, guiding the algorithm’s successive patterning and thereby curating the results.
I’ve been wondering lately what this sort of thing might eventually do for the field of sound design.
How It Works
We take a powerful computing system (you can even spin up an AWS cluster to do this) and pick an algorithm. Right now, this instance is tabula rasa, a newborn with no concept of anything in the world. So we start to show it things.
Perhaps we feed it tons of pictures of cars and it starts to zero in on visual information about the landscape behind each one in the picture: reject! In this case, we are specifically hoping for it to discover cars instead, so, we course correct it. Back to the drawing board: the algorithm now knows one thing it doesn’t really care about.
Perhaps it groups a bunch of shots of what we see to be orange cars, saying, “hey, I think there’s something here.” We’d tell it it’s discovered “car” and “orange” and send it back to to work. We keep labeling its attempts as it gradually tightens its definitions of the world we’ve shown it.
Eventually, you end up with a system which is exceedingly good at telling you which images feature cars and which don’t. There are many more things we can do with it, not the most important of which is assisting sound design.
What It Could Mean for Sound
Machine learning operates on large quantities of data, and the first place any sound designer can identify with that is with the concept of a sound effects database. This database is often a two-way channel: we use verbal abstractions, descriptors like “wind,” “low,” “impact,” “dog” to label sounds according to what qualities they have; we use the same sorts of strings to search for and retrieve these files.
If we can get our algorithm conscious of these descriptors, we can use it to affect both these stages of the workflow. For the algorithm to deal in these descriptors, it needs to be able to process sound the way we do. Warning, loosely technical explanations ahead.
We’re interested in putting soft labels on characteristic combinations of a bunch of sonic qualities–duration, amplitude, frequency, timbre, etc.–so we need the machine to be able to analyze those qualities of the sound. While a computer can’t ‘listen’ in the way that we do, that is, make value judgments on an acoustic waveform, it can perform lots of useful math on a digital audio signal. Using FFTs to bring audio into the frequency domain for analysis opens up even more ways for an algorithm to find patterns and start to sort our data out.
[perfectpullquote align=”left” cite=”” link=”” color=”” class=”” size=””]If we can get our algorithm conscious of these descriptors, we can use it to affect both these stages of the workflow. For the algorithm to deal in these descriptors, it needs to be able to process sound the way we do.[/perfectpullquote]
Jump ahead and realize that 1) Shazam has been a thing for years, now and 2) we’re already using FFTs to determine song genres with a very high rate of confidence, and all this starts to seem possible. Here are some basic possibilities:
Envision an algorithm poring over your library and your collection of personal recordings and beginning to identify commonalities. You can ‘tag’ buckets as it finds them, send it back to work and see if new and even more interesting subcategories emerge. It might find “loud” and “soft” sounds first; you’re not interested in that distinction because it’s so easily manipulated afterwards with gain so as not to be useful. Sounds with a persistent noiseprint throughout them, though, might be useful to classify. It starts finding low and high sounds. Different frequency bands. Sounds that start loud but and immediately get quiet. Maybe these are impacts? You pick a handful out of there that you really like and identify them as having “good transients” and tell it go find more that might sound like those. I really don’t know how this works, but I suspect that it could.
Eventually, you’ve turned your search process into one where you’ve tuned your machine to understand sounds the same way you do, using your same terminology, because you’ve trained it since inception. It knows you. When you say, “I need a good, punchy sub hit here, similar to a kick drum, but organic,” or, “I need a long wind that’s really steady without a lot of crispy textures,” it can pattern match to find those things.
You could also use it to search and match given brand new audio information–if you create brand new source in a sampler, you might have the algorithm automatically label it for (based on what it knows you’ve labeled similar sound characteristics) and to go and find more of it. Imagine never having to label a file again.
[perfectpullquote align=”right” cite=”” link=”” color=”” class=”” size=””]Eventually, you’ve turned your search process into one where you’ve tuned your machine to understand sounds the same way you do, using your same terminology, because you’ve trained it since inception. It knows you.[/perfectpullquote]
What Might Be Next
This is all great for pinpoint sound search and recovery, but what about encouraging experimentation, accidentally discovery, and mistakes? Would you run a query for something, say, “metal” and slacken the accuracy percentage to let it pull in something totally unexpected? Could the algorithm eventually come to know your mind well enough that when you tell it what you’re going for with a mood–but without any specific sound characteristics in mind–it could draw those connections, too? If you asked it for “scary” sounds, what would you get back? Could you feed it mixed pieces and identify characteristics present in them at various points on the timeline? What could it tell you about them?
What are some of the ways this process could dampen or limit creativity? Where does human creativity and ‘inspiration’ begin to emerge as separate from this model as it gets increasingly accurate?
I would love to hear your thoughts on the ways machine learning might accelerate and enhance the work that we do. Our species has a poor track record of ever turning back from the frontier of technological progress, so I think it’s a safe bet that this stuff is coming to sweep even a field so creative as ours–for there’s lots in it that, when distilled down, is fairly uncreative. This has always been the case, it’s just happening more quickly.
How far could this go, and what parts of the process will we want to have kept for ourselves?
(How long will we have to enjoy the peak of computer-assisted operation before it’s all shaken by an artificial super intelligence and the game changes completely? Sorry…)
—
Addendum: What’s Out There Now & Additional Reading
Thanks to the rest of the Designing Sound staff for making me aware of a few options in the field that are doing work on this front right now:
- SoundTorch launched many years ago (though it seems still to be in Beta) with a really captivating teaser video. Development looks like it’s slowed to a halt, but I’m sure whatever parts of it are unpatented will be reborn in software before too long.
- Tsugi’s Dataspace looks INCREDIBLY cool and like it executes on a ton of my wanderings in this post–but it isn’t available publicly just yet. I am eagerly awaiting a release! If you’ve had a chance to use it, get in touch with us. I’d love to hear about your experiences.
I highly recommend this TED talk, which gives a better primer on the basics of machine learning than I have above:
As well as The Road to Artificial Super Intelligence, a series of essays exploring the endgame we’ll have ourselves into before too long.
Rob Esler says
Really nice article.
I think a good question to ask is — does machine learning solve a problem for sound designers? I recall this news release:
http://news.mit.edu/2016/artificial-intelligence-produces-realistic-sounds-0613#.V1-OxQWangg.facebook
Could this tool solve a design problem?
Perhaps there are technical solutions like what LANDR is doing with mastering: https://www.landr.com/en. The technical side of design is probably more likely where this innovation can be used, since it is less subjective. For example, if Pro Tools could learn the patterns in which I edit audio and eventually just trim and fade automatically or learn how I use other effects and plugins, this could perhaps speed up certain design processes.
But at the end of the day I’ll keep the machines at arm’s reach from my creative process.
Luca Fusi says
Great links, Rob. I would be totally happy to surrender some of the less glamorous, incidental audio in a given edit (e.g. foliage brushing or debris in an action sequence) to an algorithm that’s learned to put one over on the majority of listeners if it buys me more time to make the cool stuff sound cooler. That’s not to say that there isn’t craft in all types of sound editing, or that you’d want to blanket hand over a type of sound in all contexts; in a delicate close-up, I would rather handle the cloth track myself.
But you can see that there’s already some enthusiasm for bringing up synthesized variation / sound creation of the type you’re mentioning in the game community with modal synthesis methods such as Wwise’s SoundSeed Impact, little-used as it still may be. Wouldn’t doubt that Rockstar’s RAGE engine is capable of lots of the same sort of thing. We’re pretty close to capable of feeding the parameters those synthesis models need to match object type and scale already, it just seems like the program you linked is able to do it from the ground up.
It’s tough to say where that process stops, though, and where we decide to draw the line on employing machine learning-driven sound synthesis / editing. Crazy to think that’s already a few clicks past what I was envisioning in this article, and it’s being written about today. Crazier still that all the funding which is driving it doesn’t seem to be coming from some of the entertainment industries it could come to affect the most. I’m guessing there’ll be lots of initial resistance to methods like this but a new generation will bring it into the workflow piece by piece.
Similarly, LANDR strikes me as a Ozone-esque solution which has access to mountains of data and can tune itself–but Ozone already came packed with presets that were / are ‘good enough’ for most folks without understanding. The introduction of both tools arguably raises the overall music production quality bar without having improved creativity across the field, but, still a net win. Perhaps the future of our education moves more strictly towards fostering creativity and all that which makes us human.