DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Qi, Zipeng; Zhang, Xulong; Cheng, Ning; Xiao, Jing; Wang, Jianzong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2309.07509 (cs)

[Submitted on 14 Sep 2023]

Title:DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Authors:Zipeng Qi, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang

View PDF

Abstract:Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are traditionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks completion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innovative approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces, all without the need for additional alignment between audio and image features.

Comments:	submmit to ICASSP 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2309.07509 [cs.CV]
	(or arXiv:2309.07509v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2309.07509

Submission history

From: Zipeng Qi [view email]
[v1] Thu, 14 Sep 2023 08:22:34 UTC (618 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators