May 24, 2018

Facebook researchers use AI to turn whistles into orchestral music, and power other musical “translations”

By: Noam Mor, Lior Wolf, Adam Polyak, Yaniv Taigman

Researchers from Facebook Artificial Intelligence Research (FAIR) have developed an AI system that can translate music, accepting a broad range of audio input—everything from multi-instrument symphonies to simple whistles—and outputting various kinds music. It’s the first known use of AI to create high-fidelity music by translating between different instruments automatically, as well as different styles and genres of music. And though it’s an important step for research into AI that requires minimal training—the team’s auto-encoder can convert unfamiliar music without being prepped or supervised—this work also points to the possibility of AI-powered music creation, with full songs generated from little more than hummed tune.


FAIR’s universal music translation system is part of a larger exploration by the AI community into unsupervised translation. Typical translation systems learn by example, with AI that’s trained on matching pairs of images or text, building a sense of what makes a given piece of data similar to another. Researchers call this supervised learning, and though it’s the most common way to train AI, it’s also time- and labor-intensive, and can lead to systems that can’t adapt in the moment.

Strategic confusion

FAIR’s method still requires training to create different kinds of musical output—such as piano in the style of Beethoven, or the choral vocals of a cantanta. But to allow the system to transform music in an unsupervised—you might even say improvisational—way, the team intentionally distorted the musical input, with something called a domain confusion network. This prevents the AI from encoding domain-specific information. In other words, the system is forced to ignore the unique aspects of a recorded song’s style, genre and instruments, and create translations based on the core structure of the music.

Hearing is believing

A new paper from FAIR, A Universal Music Translation Network, fully details the system’s single-encoder, WaveNet-based architecture, including the novel approach of distorting musical input by shifting it slightly out of tune, and the use of eight Tesla V100 GPUs for six days of training on six different musical domains. The paper also includes the evaluation scores that indicate what appears to be unprecedented—the system is as good or only slightly worse at converting one musical instrument into another, with many human evaluators unable to tell which file was the original input or AI-generated output. But the bigger impact might come from the sample audio comparisons, which you can hear for yourself. Some translations are rougher than others, but the system delivers a number of realistic conversions.

What’s next

Facebook has no plans for a specific product or feature based on this work, but FAIR’s research is a strong indicator of how AI could soon power human creativity. From composing whole symphonies with your voice to transforming a simple guitar lick or MIDI tune into layered vocals, this approach could democratize songwriting, and make music production more accessible.