JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

In neural text-to-speech (TTS), two-stage system or a cascade of separately learned models have shown synthesis quality close to human speech. For example, FastSpeech2 transforms an input text to a mel-spectrogram and then HiFi-GAN generates a raw waveform from a mel-spectogram where they are called an acoustic feature generator and a neural vocoder respectively. However, their training pipeline is somewhat cumbersome in that it requires a fine-tuning and an accurate speech-text alignment for optimal performance. In this work, we present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models. Specifically, our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module. Since there is no acoustic feature mismatch between training and inference, it does not requires fine-tuning. Furthermore, we remove dependency on an external speech-text alignment tool by adopting an alignment learning objective in our joint training framework. Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS on subjective evaluation (MOS) and some objective evaluations.

Related collections

Author and article information

Journal

Publication date Created: 31 March 2022

Article

ArXiV ID: 2203.16852

SO-VID: 2e2266c1-782e-4de0-b845-aac9d0041b34

License:

http://creativecommons.org/licenses/by/4.0/

History

Custom metadata

Comments Submitted to INTERSPEECH 2022

Categories eess.AS cs.LG cs.SD

ScienceOpen disciplines: Artificial intelligence,Graphics & Multimedia design,Electrical engineering

Data availability:

ScienceOpen disciplines: Artificial intelligence, Graphics & Multimedia design, Electrical engineering

JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

Read this article at

Abstract

Related collections

Radiology and Natural Language Processing

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 107