Artificial Vocal Learning guided by Phoneme Recognition and Visual Information

Advanced search
Browse by:

Department | Year

UCL Theses | Latest

Deposit your research

Artificial Vocal Learning guided by Phoneme Recognition and Visual Information

Krug, Paul Konstantin; Birkholz, Peter; Gerazov, Branislav; Van Niekerk, Daniel Rudolph; Xu, Anqi; Xu, Yi; (2023) Artificial Vocal Learning guided by Phoneme Recognition and Visual Information. IEEE/ACM Transactions on Audio, Speech, and Language Processing 10.1109/taslp.2023.3264454. (In press). Green open access

[thumbnail of Xu_Krug_etA_IEEE2023_accepted.pdf]

Preview

Text
Xu_Krug_etA_IEEE2023_accepted.pdf - Accepted Version
Download (741kB) | Preview

Abstract

This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the primary measure of learning success. Thereby, a novel approach for artificial vocal learning is presented that utilizes deep neural network-based phoneme recognition in order to calculate the speech acquisition objective function. This function guides a learning framework that involves the state-of-the-art articulatory speech synthesizer VocalTractLab as the motor-to-acoustic forward model. In this way, an extensive set of German phonemes, including most of the consonants and all stressed vowels, was produced successfully. The synthetic phonemes were rated as highly intelligible by human listeners. Furthermore, it is shown that visual speech information, such as lip and jaw movements, can be extracted from video recordings and be incorporated into the learning framework as an additional loss component during the optimization process. It was observed that this visual loss did not increase the overall intelligibility of phonemes. Instead, the visual loss acted as a regularization mechanism that facilitated the finding of more biologically plausible solutions in the articulatory domain.

Type:	Article
Title:	Artificial Vocal Learning guided by Phoneme Recognition and Visual Information
Open access status:	An open access version is available from UCL Discovery
DOI:	10.1109/taslp.2023.3264454
Publisher version:	https://doi.org/10.1109/taslp.2023.3264454
Language:	English
Additional information:	This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords:	Vocal learning simulation, articulatory speech synthesis, automatic phoneme recognition
UCL classification:	UCL UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > Div of Psychology and Lang Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > Div of Psychology and Lang Sciences > Speech, Hearing and Phonetic Sciences
URI:	https://discovery-pp.ucl.ac.uk/id/eprint/10168267

Downloads since deposit

9,039Downloads

Download activity - last month

Download activity - last 12 months

Downloads by country - last 12 months

Archive Staff Only

View Item