Procedural Audio: An Interview with Nicolas Fournel
Nicolas Fournel has had a long career with game audio and has been at the forefront in the research and development of audio tools. He maintains procedural-audio.com, a repository of procedural audio information and the recently created the Procedural Audio Interest Group. He was more than happy to add his voice to the procedural audio series based on his varied experiences in developing such tools in the ‘field’.
DS: Your work has taken you all over the world with companies like EA, Konami and Sony. Tell us about it.
NF: I guess the professional “game audio” aspect of my career really started in the early 90’s. I was developing commercial audio software on Amiga and PC and later created a company in Paris which was specialized in audio synthesis. One of the products was Virtual Waves, a software modular synthesizer which included many types of synthesis, processing and analysis modules. That was long before Reaktor (or Generator as it was first called) and this kind of software. Instead of coming with instrument sounds, Virtual Waves shipped with a lot of patches generating sound effects, so among our clients were a lot of French game studios and supporting them was really interesting. In a sense, it was already “procedural audio” for games. Besides that, at the time I was also building alternative controllers like the Semekrys, which was an instrument built from 4 large touch-screens in glass, mounted on a big plexiglass structure.
Anyway, in 2000 there was an opportunity to work on the audio of the forthcoming GameCube console and thus I moved to San Francisco to join Factor 5. They were located on the same road than the Skywalker Ranch and worked closely with Lucas Arts. They were famous for Rogue Squadron on the N64 (and before that, Turrican etc…) but they were also very involved in the audio system of the future GameCube. I worked on MusyX (which was the audio tool / run-time shipped by Nintendo with the GameCube devkits). I also designed a couple of resampling filters for the ROM of the DSP of the GameCube itself as well as some other weird things like a very crude HRTF system. Because there was so little space left in the ROM it was pretty rough and never got used I think. At the end, all we got left was something like 10 bytes in the ROM and we just put our initials in there…
After 5 years working on more Star Wars games and porting MusyX to the XBox, I moved to Konami. The studio was in Hawaii, located on the last floor of a building overlooking Waikiki beach. I remember some interesting exchanges with the local HR guy when I interviewed there, like please always wear a shirt at work (meaning don’t come back half-naked from the beach), or do you think you will ever get island fever (or feel trapped because you can pretty much drive around the whole island in 2 hours and a half)? They had been desperately looking for an audio programmer for a while. The situation was quite urgent with several games in production and no audio tech, so I was given “carte blanche”, which was really great but also totally crazy: after just a couple of months I found myself responsible of two new audio tool chains and engines (one for the DS, one for all the other platforms), and implementing audio in 3 or 4 games at a time. However it was a great experience because I was embedded in the audio department and therefore I was really able to support the sound designers.
Unfortunately Konami eventually closed the Hawaii studio (I would probably still be there otherwise!). At that point I joined EA Tech, the central technology group of Electronic Arts located in Burnaby (Vancouver). At the time, 90% or more of the games from EA were using the audio technology from that group. During the interview they made me visit their brand new campus which was awesome, but actually EA Tech at the time was located outside of the campus, a mile away in an older building which belonged to IBM. So I ended up there with roughly 60 other programmers and no artist. Very smart people obviously, but not the creative atmosphere I was looking after, and the climate change was quite drastic after Hawaii, so after about two years I moved to Sony Computer Entertainment, in London.
Being Sony, it had a lot to offer obviously, from interesting game concepts to new controllers or platforms. More importantly, I was part of the Creative Services Group, giving me the opportunity to work with artists directly once again. Also CSG is a central service group part of SCE World Wide Studios, so it was great to be in touch with all these teams in various studios or even with Sony US or Japan. I really had a great time there, working on technology with brilliant people and pushing for more use of procedural audio and audio feature extraction in games and tools. Some of the work I did there ended up being patented. I was also able to talk about these topics in many conferences. Advocating for procedural audio and sharing my work with people of the industry has been one of the most rewarding things I did in my career. I strongly encourage people to do it!
DS: You have been around in the industry for a considerably long time! Has the mindset of developers changed over the years in accepting tools that might make processes easier, if not being revolutionary?
NF: I’m not sure the mindset has fundamentally changed, but the circumstances have. Twenty five years ago, you could make your game alone in your bedroom, including all the programming and all the art. You could hack everything together quickly, enter the sprites directly in an array in binary, and be done with it. So you didn’t really need tools. With games becoming more and more complex, and with the exponential growth of the number of assets required to populate bigger and bigger virtual worlds, tools have become indispensable. So I think it has been more of a logical evolution. Once you are confronted with a very cumbersome or repetitive task, you will naturally try to find ways to make your job easier.
However, in some cases, maybe these tools didn’t evolve as much as they could have. Some of them are still very little more than glorified Excel sheets. Historically, tools in the game industry have been built more to package assets in a convenient way or to help organize data in scenes or levels, rather than to empower the artist and help him create something truly adapted to an interactive media. The creation part has traditionally been left to other – more traditional – software providers, and only recently have they started to provide specific functions targeted to game creators. Also, being a tool programmer in the game industry has always been far less glamorous than being an engine programmer, coding close to the metal and able to show his or her contribution to the game directly on the screen. With bigger teams and the industry going mainstream though, we now have developers coming from all horizons and we get some people truly passionate about building good tools.
The reason why I liked my experiences at Konami and Sony is because I was immersed in the audio or creative departments, and not with a bunch of other programmers. If you really want to create the best tools you must be sitting in the middle of the people who need them and will be using them. It’s not so much about asking them what they want, because invariably they will ask for what they know: minor updates to the tools they have, stuff they have seen in another studio or in a middleware. It’s more about “living” with them, looking at their workflows and bringing new solutions. Nobody can expect a sound designer to know everything that is possible to achieve with some advanced audio feature extraction method, or with artificial intelligence algorithms adapted to audio, even if the new generation of technical sound designers is way more up to date with that nowadays. Too often unfortunately, the audio engineer on a team is an overworked programmer in charge of many other things, and sitting on another floor.
I think tools make such a huge difference in the game development process that it should be one of your questions to the team when interviewing for a new position. Looking at the tool chain is pretty much having a peek into the future and seeing how production will unfold. Show me your tools and I will tell you how long you will crunch…
DS: procedural-audio.com, the website you maintain, is a great collection of papers and information on the subject. In your opinion, what are the strengths and weaknesses of procedural techniques, both, from the perspective of a technologist and sound designer?
NF: Well I guess everybody will agree that the main benefits of procedural audio are to provide almost infinite variations to the sound effects generated and to reduce the memory footprint while doing so (since we are using a model to generate the sound data at run-time and not storing the assets themselves anymore). If you take a wind sound effect for example, a typical implementation would require using long samples to avoid the sensation of looping and mixing them a bit randomly or based on the weather in the game. This will either use a lot of memory or eat one or more streams. But you could synthesize a very convincing wind by simply using a white noise generator and sending it through a bank of resonant bandpass filters which will basically simulate the obstacles in the path of the wind. In that case you just need to store a few coefficients for the filters, and to update them in real-time to create variations in the wind. Suddenly we are talking about only a few dozens of bytes, for something which is far more flexible. This is what SoundSeed Air from AudioKinetic has been doing for a while for example and even how we created wind sounds on the old analog synthesizers.
The main drawback of procedural audio is of course the CPU resources sometimes required to synthesize the sound. The wind example I mentioned is really cheap, however other models can be quite expensive, especially if you need several instances of them playing concurrently in a game. You can mitigate that by using procedural audio only for certain sounds, the ones closer to the listener, or the ones needing more variation. In general, the fact that with procedural audio not all voices are created equal – and that some patches will use more CPU cycles than others – is usually not a very good selling point against the predictability of a fixed voice architecture.
Once you leave the academic field and actually start implementing a procedural audio system in a game, you are actually are confronted with a lot of problems like that: optimizing of course, but also setting up a tool chain, choosing the best way to update a large amount of parameters per frame, interfacing with the other subsystems such as animation, physics etc… Procedural audio requires a greater interaction between sound designers, game designers and programmers. And with QA too…That’s a usually underestimated problem. Be prepared to get a lot of bugs like “one time, it didn’t sound right”. So what was wrong? When you deal with sample playback only, you have a relatively short list of possible bugs and you know the usual suspects for most of them. For example, you know that if you don’t hear a sound, maybe the corresponding bank was not loaded, or the sound was missing from it, or the maximum number of instances allowed was reached, or the volume was incorrect due the inner/outer radius settings etc…But with procedural audio, between all the modules and the typically larger number of input parameters involved, there are many more combinations and potential causes for problems. Any module from any patch could be at fault, as well as the coupling between the procedural audio engine and the other game subsystems.
But to come back to procedural-audio.com, the web site was created it in 2007 to serve as repository of papers, tutorials, demos and links related to Procedural Audio, and more generally to help sharing the knowledge. Over the years, it has become a reference for students, researchers and industry folks. There is now a Twitter account associated with it (@proceduralaudio) and I encourage your readers to send us tweets about their work, published papers, or even failed experiments so that we can include them on the web site. In addition, a few months ago I added the PAIG or Procedural Audio Interest Group, whose mission is to advocate the use of procedural audio and help PA practitioners share techniques, models etc… It’s a very informal thing, there is no fee to join, no affiliation with a company or an industry group, and it’s open to everyone. One of the first initiatives is the forum which recently went online (with the help of Jorge Garcia Martin) and where we started to have a couple of interesting discussions already. So if you are into procedural audio, join us!
DS: In the interview with Andy Farnell, we talked about some of the limitations with procedural audio not being technical or creative but more structural and political. What are your experiences and thoughts about this?
NF: While I agree with Andy that there are some structural and political barriers, I also believe there are still an awful lot of technical and creative issues to tackle before you can say that procedural audio is ready to be used outside of academia, on a daily basis in the production cycle of any AAA game (of course it has already been used in some very specific cases). And fixing these issues can help eliminate some of the more structural and political limitations.
One obvious technical issue is that there is no established pipeline that someone who would want to dive into PA could pick up and use right away. By that I mean there is no available combination tool to create the models and run-time engine to play them on the main platforms. There are some very interesting things starting to happen, like libpd for mobile, but nothing for the PS3, Xbox 360, or Wii. Of course there are some proprietary techs in a couple of studios, or ways to interface with Unreal for example, and there are companies who want to sell you predefined models, but that’s all, no well-established middleware where you can create a model easily and in a friendly GU, drop it into your game and it just works. So we are already missing an essential part here. Having a model running in a tool does not mean that you are ready for production!
On the creative side, I would like to mention something I feel strongly about – especially after working with sound designers in the game industry – that is the importance of a top-down approach. Currently, tools like Pure Data -which is kind of a de facto standard to experiment with PA – favour a bottom-up approach to create models. These are very good tools to prototype and to learn about synthesis and procedural audio. However, in my experience, this bottom-up approach has failed for a number of reasons and I will repeat here something I wrote recently on the PA forum if don’t mind.
First, it is overcomplicated, patches are very hard to read and do take a lot of time to create, even if you can reuse blocks. And on the run-time side of course, using a lot of elementary modules talking to each other is not really efficient. It also requires a lot of knowledge from the sound designer. The fact that most sound designers doing this kind of patching are calling themselves “technical” sound designers speaks for itself. Not only does it require audio synthesis knowledge but it also often requires knowledge about the various audio production mechanisms of the sounds you want to emulate. Even after that, you will need a huge amount of tweaking to get the patch to sound right. All this is quite incompatible with a typical game production (or even pre-production) schedule, especially if you want to use PA for more than one or two types of effects in your game.
More importantly, apart from some specific cases (for example the wind we were talking about earlier), this approach didn’t produce many convincing models. Relatively close, yes. Very satisfying intellectually yes, because you can understand how the sound is produced etc…but usable, rarely. You can argue with that one, but my personal experience is that very few models would support a blind test with recordings. Which is also why game directors are not looking at PA seriously yet. (Of course PA does not have to always be realistic and accurately model physics, you can use it in many other ways.)
Finally, the bottom-up approach does not offer a direct path between the idea the designer has of a sound and how it will actually sound in the game. A designer can recreate pretty much any sound he thinks of by mixing various recordings, synths etc… it’s a lot harder to do it with bottom-up PA. You often end up with a sound which is basically what you were able to achieve in Pure Data for example, but not quite exactly what you wanted in your game.
Using a top-down approach – where you create a couple of sounds you like with the conventional means and then transform them into a dynamic model using various analysis and modelling techniques – could help creating more convincing sound effects, and would remove the need of having a new breed of “technical” sound designers. So suddenly, you have less structural and political hurdles too. Game directors are not stupid: if they can see the value of a technology for their game, they will adopt it. If they haven’t embraced procedural audio after all that time, it is that something is still not at the level where it would be acceptable…
DS: A lot of the PA techniques are useful for the mobile app market because of the requirements for maximum flexibility with a minimum memory footprint. Commercial procedural solutions by companies like Audio Gaming seem useful for both linear and nonlinear media. If you have to do some prophesying, where do you see all of this going?
NF: It’s definitely an exciting time for P.A. But although there is more talk about it and the new generation of technical sound designers is more excited about it, there were already people working on such things more than 10 years ago. A bit more recently, around 2003 or 2004, I visited Staccato Systems in San Jose. They had some very competent people on-board, coming from the CCRMA in Stanford, and were already developing some procedural audio models for the game industry. They had excellent footsteps models which probably sounded better than anything else I have heard since. They also had car engine models etc… Their idea, already at the time, was to sell ready-to-use models to game companies or to develop them on demand.
But I guess that business model didn’t work at the time and that’s where companies specializing in procedural audio today need to think hard: what is your business model? I don’t think selling closed patches with only a few exposed parameters is a viable business model. We sometime compare sample playback to sprites, and procedural audio to 3D models. If you were to keep that analogy, selling P.A. models comes back to selling 3D objects. Understandably, no AAA game studio should want to buy that. Like a 3D artist will want to create its own model in Maya, a sound designer will want to craft his own sound model, because his or her idea of how the sound should fit in the game is probably different from the idea coming from a company which has never seen the game.
This might be more acceptable for smaller structures, or indeed for lower budget games, but you still take the risk of sounding exactly like one of the many other games using the same model. So, again, I think it is better to help sound designers create their own models than to sell them something already done.
In any case, I think it’s important to keep in mind that procedural audio is just an extra tool in your toolbox, not a magic wand that will replace everything. It is pretty clear that with the progress done in acoustic research and with more powerful platforms coming, both consoles and mobiles, it is something that will develop more and more, but it will just coexist with the other methods to play sound in a game, not replace them.
DS: What about sounds that are less important in the hierarchy, like environment sounds in a FPS that might not be heard continuously because of a dense soundtrack? It might not be as important for such sounds to be unique.
NF: In that case, its more a balancing act indeed. If the environment sounds really are in the background and everything else is exploding in front of you, there is very little need to craft a complex procedural audio model. Firstly it would not be very discernible and secondly it would probably eat a lot of CPU cycles for nothing. You could probably obtain a good enough result with layering and randomizing assets and with pitch, volume or filter variations. That’s actually valid for all sound effects: it all depends on the level of detail needed and the number of patch instances you would need to run simultaneously.
A common solution is to switch between sample playback and procedural audio model when it makes sense. Let’s say you have a horde of monsters coming at you. On the tool side, you could actually use your procedural audio model but record its output with various parameters, and use that for randomized assets. Then, when one monster comes closer to you, switch to the real-time model for that one. Since you used the procedural audio model to create the recorded versions too, you ensured that the experience would stay homogeneous.
But since we were talking about the generation of backgrounds or ambiences, there has also been some interesting research in that domain based on audio analysis, and therefore with kind of a top-down approach. A few years back, the team from Perry R Cook at Princeton developed Tapestrea, a system specialized in the analysis, transformation, and generation of environmental audio. Basically a soundscape can be decomposed into deterministic events, made of highly sinusoidal components, transient events which are brief non-sinusoidal sounds like footsteps, glass breaking, etc… and a stochastic background which is pretty much everything else. Once you have been able to break down a signal into these thee components, you can resynthesize many variations of it, which will be perceptually similar but always different. The deterministic part can be resynthesized by using a variation of additive synthesis and the transient events are usually extracted by filtering during the analysis phase, and can be re-injected at whatever time you want. For the stochastic background, which is the hardest part, they were using using a wavelet-tree learning algorithm, so this is still very heavy computationally speaking, and not yet at a quality level where you could use it in production but nevertheless very interesting.
DS: The key to good technological solutions is in making them smart. A procedural tool that allows a sound designer/game developer to be more creative, without them spending time with the actual procedures, would be most useful. How far are we from seeing such tools? You have also previously talked about the audio engine analysing spectral content to make decisions.
NF: Developing a smart procedural audio tool, using a top-down approach, is what I tried to do with Spark at Sony. You can create procedural models very quickly by analysing existing samples, extracting the features of interest, and then finding a way to model them. It’s not only about the timbre of the sound itself, but also how it gets triggered and how it evolves. So the modules are roughly separated in three main families: audio generators, event generators, and update modules. You can for example drop a sound in the tool and it will automatically extract the resonating modes. But you could also choose to extract the events, for example from a broken glass sample, by detecting the transients. Once the tool has that list of transients, it uses a “distribution finder” to figure out what kind of distribution they were following and creates a model for that. This is all done automatically, you drop the sample, and you get a distribution model, ready to use, that is generating events in the same way that your original sample, but with a bit of variation each time if you want. The same applies to curve models: you drop a sample and decide what you want to extract for example the pitch contour, the amplitude envelope or say the spectral flux. Using linear regression and other methods, the tool will then create a model that can recreate the same type of curve and many variations of it.
Using these objects: a distribution model, a curve model, a resonant body etc… The number of modules required to create a complex patch is greatly reduced. Thanks to the analysis you can for example simulate that probability distribution with just one module, instead of connecting a dozen of them. But of course it is bigger investment in resources and R&D to develop this kind of tools, than just use Pure Data and say: here are all the elementary modules you might ever need, go learn about probability distributions, go learn about modal synthesis, and build everything from scratch… One of the goals of smart tools is to put the technology back in the hands of the engineers, and leave the creativity to the designer.
Also I just wanted to clarify something that Andy Farnell said in his interview: it was mentioned that Spark was centred on a phase-vocoder or some other fixed analysis / resynthesis method. That is absolutely not the case. Although there is of course spectral analysis involved, and we are certainly using it for example to get the modes of resonant impact sounds, Spark relies on another system I developed called AFEX (for Audio Features EXtraction). This system includes a lot of DSP and analysis functions and also comes with about 30 feature extraction plug-ins. So really, depending on the type of sound object you want to model, we could use plethora of methods and not only for the timbre as I explained earlier, but for all kinds of characteristics. Also the interview stated that Sony would rather sell you predefined models than let you create them: this is actually quite the opposite. First because it was an internal project, so there was absolutely zero motivation to do that. But more importantly because the whole philosophy of this project was about empowering the sound designers to create their own models semi-automatically, from their own existing sounds.
The spectrally-informed audio engine you mentioned is a bit different. It’s an example of idea where you could use audio analysis to make your run-time smarter. The spectral content of the assets can be analysed, either on the tool side for samples or at run-time for procedurally generated assets. The engine maintains a matrix which described what frequencies are played, on which channel, and with what amplitude. Then it can make informed decisions based on that. Applications of such a system range from perceptual voice management to audio shaders (effects based on the frequency content of the assets), or dynamic mixing.
But with all these smart tools or engines, obviously you don’t want the computer to make all the decisions for you. In the run-time you must be able to set boundaries, and on the tool side you must keep a “power user” mode, in which all parameters are still available and you can use and even abuse them. For example in the Spark tool I mentioned, although you can just drop a sample and automatically create a distribution model from it without any interaction at all, you can also hold the “Alt” key while doing so and in that case you will get access to all the analysis parameters, and therefore tweak the way the model is built.
DS: You are now CEO and founder of tsugi. Tell us about it. What else do you have planned for the future?
NF: “tsugi” means “next” in Japanese. So what we are doing is imagining and designing the next generation of tools and technologies for our clients, which include game studios, console manufacturers, middleware companies, post-production houses, and professional artists of all disciplines. This is not a company specialized in procedural audio (or even audio actually) although this is obviously something we are good at and we leverage that kind of knowledge when it does make sense.
Basically we look at the way artists work and we create smart tools, which will both help them to be more productive and offer them more creative options. These smart tools understand the data they manipulate and therefore are able to visualize it in the most appropriate way or to offer the most pertinent processing options to the user. To achieve this, we use a combination of digital signal processing, artificial intelligence, procedural techniques and innovative interfaces. So, if you are looking for next-next-gen tools and technologies, if you are confronted with a seemingly impossible game audio problem or just need workflow improvements for your creative team, feel free to contact us, there is a good chance we can help you.
As for the future, you will see some of our internal software become available on our web site (www.tsugi-studio.com). We will soon have tsugi QuickAudio which is a Shell Extension that creates thumbnails for the audio files in Windows Explorer. In the same way that you can see your pictures in Explorer, you will be able to see the waveforms of your .wav, .aif, .mp3, or .ogg files, as well as their spectrum. So it’s easy to instantly know which ones are loud, which ones have a slow attack, which ones are high-pitched etc… Because it’s a Shell Extension you can also see these thumbnails when selecting files in most of your audio applications, from SoundForge to WWise for example. In addition, you can also listen to the files just by pressing a key, without having to open the Windows Media Player, and you can quickly edit them without launching a DAW too. So it’s really massively improving the workflow of anybody dealing with audio files. It’s pretty addictive actually: once you have used it, it’s hard to go back to a machine on which it’s not installed! And for the “game audio” folks out there, you can select a group of samples in Explorer and automatically create a WWise work unit or a FMOD project from them.
Finally, as we are located in Japan, we also provide support and localization services in Japanese for western companies. There should be a couple of major announcements in that domain on our web site soon.