Publication

TTS Skins: Speaker Conversion via ASR

Interspeech


Abstract

We present a fully convolutional wav-to-wav network for converting between speakers’ voices, without relying on text. Our network is based on an encoder-decoder architecture, where the encoder is pre-trained for the task of Automatic Speech Recognition, and a multi-speaker waveform decoder is trained to reconstruct the original signal in an autoregressive manner. We train the network on narrated audiobooks, and demonstrate multi-voice TTS in those voices, by converting the voice of a TTS robot.

Related Publications

All Publications

EMNLP - October 31, 2021

Evaluation Paradigms in Question Answering

Pedro Rodriguez, Jordan Boyd-Graber

ASRU - December 13, 2021

Incorporating Real-world Noisy Speech in Neural-network-based Speech Enhancement Systems

Yangyang Xia, Buye Xu, Anurag Kumar

NAACL - June 6, 2021

Leveraging Slot Descriptions for Zero-Shot Cross-Domain Dialogue State Tracking

Zhaojiang Lin, Bing Liu, Seungwhan Moon, Paul Crook, Zhenpeng Zhou, Zhiguang Wang, Zhou Yu, Andrea Madotto, Eunjoon Cho, Rajen Subba

Uncertainty and Robustness in Deep Learning Workshop at ICML - June 24, 2021

DAIR: Data Augmented Invariant Regularization

Tianjian Huang, Chinnadhurai Sankar, Pooyan Amini, Satwik Kottur, Alborz Geramifard, Meisam Razaviyayn, Ahmad Beirami

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy