Tech World

Nvidia debuts AI model that can create music, mimic speech

Nvidia (NVDA) has developed a new kind of artificial intelligence model that can create sound effects, change the way a person sounds, and generate music using natural language prompts. Called Fugatto, or Foundational Generative Audio Transformer Opus 1, the model is a research project. Nvidia says it’s not announcing any plans to release the technology, but it could have broad implications for industries ranging from music and entertainment to translation services.

“The thing that’s so exciting about [Fugatto] is that having a model that you can prompt to ask it to make sounds in certain ways really opens up the landscape of things that you can imagine doing with it,” Bryan Catanzaro, vice president of applied deep learning research at Nvidia, told Yahoo Finance.

What sets Fugatto apart from other models, Catanzaro explained, is that it can perform the tasks of several other models. For instance, there are models that can synthesize speech and others that can add sound effects to music; Fugatto, however, does it all. Think of it as a kind of complement to video- and image-generating models like Stability AI’s Stable Video Diffusion or OpenAI’s Sora.

“The foundational improvement here is that … we’re able to synthesize audio using language, and that, I think, opens up new prospects for tools that people can use to create amazing audio,” Catanzaro added.

According to Nvidia, Fugatto is the first foundational model with emergent properties, which means it’s able to mix the elements it’s been trained on and follow “free-form instructions.”

Nvidia CEO Jensen Huang before a baseball game between the San Francisco Giants and the Arizona Diamondbacks in San Francisco, on Sept. 3, 2024. (AP Photo/Jeff Chiu) · ASSOCIATED PRESS

The model can generate audio via standard word prompts as well as manipulate audio files that you upload. So if you have a file of a person speaking, you could translate that person’s words to another language while still making it sound like their voice. You could also take a simple tune and make it sound like an orchestral performance or add different beats to music.

You can also upload a document and have the model read it in any voice you’d like. What’s more, you can tell the model to produce voices that carry emotional weight. Want audio of a dejected English teacher reading Edgar Allen Poe? Fugatto should be able to do it.

Catanzaro, however, warns that the model isn’t always perfect. And some results are better than others.

Like generative image and video models, Fugatto raises questions about the potential impact on artists, sound engineers, and people in related fields. Catanzaro, though, says he hopes the technology helps musicians.

“I hope what it means is new tools for artists to explore,” he explained. “I think audio has always been a fruitful place for exploration. You know, when we get new tools for audio, sometimes we get new forms of music.”


Source link

Related Articles

Back to top button

Adblock Detected