A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis

Wang, Xin; Takaki, Shinji; Yamagishi, Junichi; King, Simon; Tokuda, Keiichi

Repository landing page

oai:pure.ed.ac.uk:publications/fa6bb2ca-52e0-41fc-b42e-b3aa3b2700f9

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis

Authors: Xin Wang
Shinji Takaki
Junichi Yamagishi
Simon King
Keiichi Tokuda
Publication date: 1 January 2020
Publisher
Doi

Abstract

Recurrent neural networks (RNNs) can predict fundamental frequency (F0) for statistical parametric speech synthesis systems, given linguistic features as input. However, these models assume conditional independence between consecutive F0 values, given the RNN state. In a previous study, we proposed autoregressive (AR) neural F0 models to capture the causal dependency of successive F0 values. In subjective evaluations, a deep AR model (DAR) outperformed an RNN. Here, we propose a Vector Quantized Variational Autoencoder (VQ-VAE) neural F0 model that is both more efficient and more interpretable than the DAR. This model has two stages: one uses the VQ-VAE framework to learn a latent code for the F0 contour of each linguistic unit, and other learns to map from linguistic features to latent codes. In contrast to the DAR and RNN, which process the input linguistic features frame-by-frame, the new model converts one linguistic feature vector into one latent code for each linguistic unit. The new model achieves better objective scores than the DAR, has a smaller memory footprint and is computationally faster. Visualization of the latent codes for phones and moras reveals that each latent code represents an F0 shape for a linguistic unit

Similar works

Full text

Open in the Core reader

Download PDF

Edinburgh Research Explorer

oai:pure.ed.ac.uk:publications...

Last time updated on 27/01/2020

This paper was published in Edinburgh Research Explorer.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.