Publication

Fitting New Speakers Based on a Short Untranscribed Sample

International Conference on Machine Learning (ICML)


Abstract

Learning-based Text To Speech systems have the potential to generalize from one speaker to the next and thus require a relatively short sample of any new voice. However, this promise is currently largely unrealized. We present a method that is designed to capture a new speaker from a short untranscribed audio sample. This is done by employing an additional network that given an audio sample, places the speaker in the embedding space. This network is trained as part of the speech synthesis system using various consistency losses. Our results demonstrate a greatly improved performance on both the dataset speakers, and, more importantly, when fitting new voices, even from very short samples.

Related Publications

All Publications

EMNLP - October 31, 2021

Evaluation Paradigms in Question Answering

Pedro Rodriguez, Jordan Boyd-Graber

ASRU - December 13, 2021

Incorporating Real-world Noisy Speech in Neural-network-based Speech Enhancement Systems

Yangyang Xia, Buye Xu, Anurag Kumar

NAACL - June 6, 2021

Leveraging Slot Descriptions for Zero-Shot Cross-Domain Dialogue State Tracking

Zhaojiang Lin, Bing Liu, Seungwhan Moon, Paul Crook, Zhenpeng Zhou, Zhiguang Wang, Zhou Yu, Andrea Madotto, Eunjoon Cho, Rajen Subba

IROS - September 1, 2021

Success Weighted by Completion Time: A Dynamics-Aware Evaluation Criteria for Embodied Navigation

Naoki Yokoyama, Sehoon Ha, Dhruv Batra

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy