Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion

Ji, An; Johnson, Michael T.; Berry, Jeffrey J.

Repository landing page

research

oai:epublications.marquette.edu:spaud_fac-1038

Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion

Authors: An Ji
Michael T. Johnson
Jeffrey J. Berry
Publication date: 1 October 2016
Publisher: e-Publications@Marquette

Abstract

Acoustic-to-articulatory inversion, the estimation of articulatory kinematics from an acoustic waveform, is a challenging but important problem. Accurate estimation of articulatory movements has the potential for significant impact on our understanding of speech production, on our capacity to assess and treat pathologies in a clinical setting, and on speech technologies such as computer aided pronunciation assessment and audio-video synthesis. However, because of the complex and speaker-specific relationship between articulation and acoustics, existing approaches for inversion do not generalize well across speakers. As acquiring speaker-specific kinematic data for training is not feasible in many practical applications, this remains an important and open problem. This paper proposes a novel approach to acoustic-to-articulatory inversion, Parallel Reference Speaker Weighting (PRSW), which requires no kinematic data for the target speaker and a small amount of acoustic adaptation data. PRSW hypothesizes that acoustic and kinematic similarities are correlated and uses speaker-adapted articulatory models derived from acoustically derived weights. The system was assessed using a 20-speaker data set of synchronous acoustic and Electromagnetic Articulography (EMA) kinematic data. Results demonstrate that by restricting the reference group to a subset consisting of speakers with strong individual speaker-dependent inversion performance, the PRSW method is able to attain kinematic-independent acoustic-to-articulatory inversion performance nearly matching that of the speaker-dependent model, with an average correlation of 0.62 versus 0.63. This indicates that given a sufficiently complete and appropriately selected reference speaker set for adaptation, it is possible to create effective articulatory models without kinematic training data

Similar works

Full text

Open in the Core reader

Download PDF

epublications@Marquette

oai:epublications.marquette.ed...

Last time updated on 09/07/2019

This paper was published in epublications@Marquette.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.