Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

A hybrid phoneme based clustering approach for audio driven facial animation

Havell, Benjamin, Rosin, Paul L. ORCID: https://orcid.org/0000-0002-4965-3884, Sanei, Saeid, Aubrey, Andrew, Marshall, Andrew David ORCID: https://orcid.org/0000-0003-2789-1395 and Hicks, Yulia Alexandrovna ORCID: https://orcid.org/0000-0002-7179-4587 2012. A hybrid phoneme based clustering approach for audio driven facial animation. Presented at: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25-30 March 2012. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing: Proceedings. Los Alamitos, CA: IEEE, pp. 2261-2264. 10.1109/ICASSP.2012.6288364

Full text not available from this repository.

Abstract

We consider the problem of producing accurate facial animation corresponding to a given input speech signal. A popular technique previously used for Audio Driven Facial Animation is to build a joint audio-visual model using Active Appearance Models (AAMs) to represent possible facial variations and Hidden Markov Models (HMMs) to select the correct appearance based on the input audio. However there are several questions that remained unanswered. In particular the choice of clustering technique and the choice of the number of clusters in the HMM may have significant influence over the quality of the produced videos. We have investigated a range of clustering techniques in order to improve the quality of the HMM produced, and proposed a new structure based on using Gaussian Mixture Models (GMMs) to model each phoneme separately. We compared our approach to several alternatives using a public dataset of 300 phonetically labeled sentences spoken by a single person and found that our approach produces more accurate animation. In addition, we use a hybrid approach where the training data is phonetically labeled thus producing a model with better separation of phonemes, but test audio data is not labeled, thus making our approach for generating facial animation less laborious and fully automatic.

Item Type: Conference or Workshop Item (Paper)
Date Type: Publication
Status: Published
Schools: Computer Science & Informatics
Engineering
Subjects: T Technology > TK Electrical engineering. Electronics Nuclear engineering
Publisher: IEEE
ISBN: 9781467300452
Last Modified: 08 Feb 2023 07:29
URI: https://orca.cardiff.ac.uk/id/eprint/38779

Actions (repository staff only)

Edit Item Edit Item