SING: Symbol-to-Instrument Neural Generator

SING is a deep learning based music notes synthetizer that can be trained on the NSynth dataset. Despite being 32 times faster to train and 2,500 faster for inference, SING produces audio with significantly improved perceptual quality compared to the NSynth wavenet-like autoencoder [1] as measured by Mean Opinion Scores based on human evaluations. The architecture and results obtained are detailed in our paper SING: Symbol-to-Instrument Neural Generator. The source code is available our github SING repository. SING is based on a LSTM based sequence generator and a convolutional decoder:

Schema representing the structure of SING. A LSTM is followed
                by a convolutional decoder

Audio samples

You will find hereafter a comparison of the notes generated using SING compared to the NSynth wavenet-like autoencoder [1]. We sampled a 100 notes random subset from the test set. We used this reduced set as a human evaluation set for performing Mean Opinion Score measurements. We present here a smaller subset: for each instrument family present in the evaluation set we sampled one member randomly. We also show the rainbowgram for each audio file. A rainbowgram is defined in [1] "a CQT spectrogram with intensity of lines proportional to the log magnitude of the power spectrum and color given by the derivative of the phase". The name of the models are given after the table.

We also provided with samples generated using SING trained with the waveform MSE as well as trained without time embedding. You can access it there.

Comparison of NSynth wavenet based decoder and SING

Instrument Ground truth Model A Model B
Bass synthetic
128-097-100
Brass acoustic
001-054-127
Flute acoustic
027-077-050
Guitar acoustic
009-069-100
Keyboard electronic
092-090-127
Mallet acoustic
002-060-127
Organ electronic
077-047-100
Reed acoustic
033-044-050
String acoustic
007-060-025
Synth lead synthetic
006-045-050
Vocal acoustic
023-057-127

Model A is SING and Model B is NSynth Wavenet autoencoder.

Comparison of SING with spectral and waveform loss and with no time embeddings

Instrument Ground truth SING - spectral loss SING - wav loss SING - no time embedding NSynth Wavenet
Bass synthetic
128-097-100
Brass acoustic
001-054-127
Flute acoustic
027-077-050
Guitar acoustic
009-069-100
Keyboard electronic
092-090-127
Mallet acoustic
002-060-127
Organ electronic
077-047-100
Reed acoustic
033-044-050
String acoustic
007-060-025
Synth lead synthetic
006-045-050
Vocal acoustic
023-057-127

Bibliography

[1]: Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Mohammad Norouzi. "Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders." 2017.