A really compelling narrative running underneath all the other news of the world right now is the ways in which machine learning and neural networks are transforming data interpretation. And really, transforming everything. The core concept of machine learning, termed over fifty years ago now, is a process which “gives computers the ability to learn without being explicitly programmed.”
It’s done by feeding an algorithm (of which there are many types) mountains of data and watching it gradually start to sort it based on the commonalities it observes. We can steer this process (called “supervised” machine learning) by approving or rejecting certain of these categorizations, guiding the algorithm’s successive patterning and thereby curating the results.
I’ve been wondering lately what this sort of thing might eventually do for the field of sound design.
How It Works
We take a powerful computing system (you can even spin up an AWS cluster to do this) and pick an algorithm. Right now, this instance is tabula rasa, a newborn with no concept of anything in the world. So we start to show it things.
Perhaps we feed it tons of pictures of cars and it starts to zero in on visual information about the landscape behind each one in the picture: reject! In this case, we are specifically hoping for it to discover cars instead, so, we course correct it. Back to the drawing board: the algorithm now knows one thing it doesn’t really care about.
Perhaps it groups a bunch of shots of what we see to be orange cars, saying, “hey, I think there’s something here.” We’d tell it it’s discovered “car” and “orange” and send it back to to work. We keep labeling its attempts as it gradually tightens its definitions of the world we’ve shown it.
Eventually, you end up with a system which is exceedingly good at telling you which images feature cars and which don’t. There are many more things we can do with it, not the most important of which is assisting sound design.
What It Could Mean for Sound
Machine learning operates on large quantities of data, and the first place any sound designer can identify with that is with the concept of a sound effects database. This database is often a two-way channel: we use verbal abstractions, descriptors like “wind,” “low,” “impact,” “dog” to label sounds according to what qualities they have; we use the same sorts of strings to search for and retrieve these files.
If we can get our algorithm conscious of these descriptors, we can use it to affect both these stages of the workflow. For the algorithm to deal in these descriptors, it needs to be able to process sound the way we do. Warning, loosely technical explanations ahead.
We’re interested in putting soft labels on characteristic combinations of a bunch of sonic qualities–duration, amplitude, frequency, timbre, etc.–so we need the machine to be able to analyze those qualities of the sound. While a computer can’t ‘listen’ in the way that we do, that is, make value judgments on an acoustic waveform, it can perform lots of useful math on a digital audio signal. Using FFTs to bring audio into the frequency domain for analysis opens up even more ways for an algorithm to find patterns and start to sort our data out.
If we can get our algorithm conscious of these descriptors, we can use it to affect both these stages of the workflow. For the algorithm to deal in these descriptors, it needs to be able to process sound the way we do.
Jump ahead and realize that 1) Shazam has been a thing for years, now and 2) we’re already using FFTs to determine song genres with a very high rate of confidence, and all this starts to seem possible. Here are some basic possibilities:
Envision an algorithm poring over your library and your collection of personal recordings and beginning to identify commonalities. You can ‘tag’ buckets as it finds them, send it back to work and see if new and even more interesting subcategories emerge. It might find “loud” and “soft” sounds first; you’re not interested in that distinction because it’s so easily manipulated afterwards with gain so as not to be useful. Sounds with a persistent noiseprint throughout them, though, might be useful to classify. It starts finding low and high sounds. Different frequency bands. Sounds that start loud but and immediately get quiet. Maybe these are impacts? You pick a handful out of there that you really like and identify them as having “good transients” and tell it go find more that might sound like those. I really don’t know how this works, but I suspect that it could.
Eventually, you’ve turned your search process into one where you’ve tuned your machine to understand sounds the same way you do, using your same terminology, because you’ve trained it since inception. It knows you. When you say, “I need a good, punchy sub hit here, similar to a kick drum, but organic,” or, “I need a long wind that’s really steady without a lot of crispy textures,” it can pattern match to find those things.
You could also use it to search and match given brand new audio information–if you create brand new source in a sampler, you might have the algorithm automatically label it for (based on what it knows you’ve labeled similar sound characteristics) and to go and find more of it. Imagine never having to label a file again.
Eventually, you’ve turned your search process into one where you’ve tuned your machine to understand sounds the same way you do, using your same terminology, because you’ve trained it since inception. It knows you.
What Might Be Next
This is all great for pinpoint sound search and recovery, but what about encouraging experimentation, accidentally discovery, and mistakes? Would you run a query for something, say, “metal” and slacken the accuracy percentage to let it pull in something totally unexpected? Could the algorithm eventually come to know your mind well enough that when you tell it what you’re going for with a mood–but without any specific sound characteristics in mind–it could draw those connections, too? If you asked it for “scary” sounds, what would you get back? Could you feed it mixed pieces and identify characteristics present in them at various points on the timeline? What could it tell you about them?
What are some of the ways this process could dampen or limit creativity? Where does human creativity and ‘inspiration’ begin to emerge as separate from this model as it gets increasingly accurate?
I would love to hear your thoughts on the ways machine learning might accelerate and enhance the work that we do. Our species has a poor track record of ever turning back from the frontier of technological progress, so I think it’s a safe bet that this stuff is coming to sweep even a field so creative as ours–for there’s lots in it that, when distilled down, is fairly uncreative. This has always been the case, it’s just happening more quickly.
How far could this go, and what parts of the process will we want to have kept for ourselves?
(How long will we have to enjoy the peak of computer-assisted operation before it’s all shaken by an artificial super intelligence and the game changes completely? Sorry…)
Addendum: What’s Out There Now & Additional Reading
Thanks to the rest of the Designing Sound staff for making me aware of a few options in the field that are doing work on this front right now:
- SoundTorch launched many years ago (though it seems still to be in Beta) with a really captivating teaser video. Development looks like it’s slowed to a halt, but I’m sure whatever parts of it are unpatented will be reborn in software before too long.
- Tsugi’s Dataspace looks INCREDIBLY cool and like it executes on a ton of my wanderings in this post–but it isn’t available publicly just yet. I am eagerly awaiting a release! If you’ve had a chance to use it, get in touch with us. I’d love to hear about your experiences.
I highly recommend this TED talk, which gives a better primer on the basics of machine learning than I have above:
As well as The Road to Artificial Super Intelligence, a series of essays exploring the endgame we’ll have ourselves into before too long.