Fitting New Speakers Based on a Short Untranscribed Sample

International Conference on Machine Learning (ICML)


Learning-based Text To Speech systems have the potential to generalize from one speaker to the next and thus require a relatively short sample of any new voice. However, this promise is currently largely unrealized. We present a method that is designed to capture a new speaker from a short untranscribed audio sample. This is done by employing an additional network that given an audio sample, places the speaker in the embedding space. This network is trained as part of the speech synthesis system using various consistency losses. Our results demonstrate a greatly improved performance on both the dataset speakers, and, more importantly, when fitting new voices, even from very short samples.

Related Publications

All Publications

Emerging Cross-lingual Structure in Pretrained Language Models

Shijie Wu, Alexis Conneau, Haoran Li, Luke Zettlemoyer, Veselin Stoyanov

ACL - July 9, 2020

Open Source Evolutionary Structured Optimization

Jeremy Rapin, Pauline Bennet, Emmanuel Centeno, Daniel Haziza, Antoine Moreau, Olivier Teytaud

Evolutionary Computation Software Systems Workshop at ​GECCO - July 9, 2020

Learning Generalizable Locomotion Skills with Hierarchical Reinforcement Learning

Tianyu Li, Nathan Lambert, Roberto Calandra, Franziska Meier, Akshara Rai

ICRA - June 1, 2020

Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data

Haytham M. Fayek, Anurag Kumar

IJCAI - July 11, 2020

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy