Transformer-based Acoustic Modeling for Streaming Speech Synthesis

Interspeech

Abstract

Transformer models have shown promising results in neural speech synthesis due to their superior ability to model long-term dependencies compared to recurrent networks. The computation complexity of transformers increases quadratically with sequence length, making it impractical for many real-time applications. To address the complexity issue in speech synthesis domain, this paper proposes an efficient transformer-based acoustic model that is constant-speed regardless of input sequence length, making it ideal for streaming speech synthesis applications. The proposed model uses a transformer network that predicts the prosody features at phone rate and then an Emformer network to predict the frame-rate spectral features in a streaming manner. Both the transformer and Emformer in the proposed architecture use a self-attention mechanism that involves explicit long-term information, thus providing improved speech naturalness for long utterances. In our experiments, we use a WaveRNN neural vocoder that takes in the predicted spectral features and generates the final audio. The overall architecture achieves human-like speech quality both on short and long utterances while maintaining a low latency and low real-time factor. Our mean opinion score (MOS) evaluation shows that for short utterances, the proposed model achieves a MOS of 4.213 compared to ground-truth with MOS of 4.307; and for long utterances, it also produces high-quality speech with a MOS of 4.201 compared to ground-truth with MOS of 4.360.

Featured Publications