Multimodal Speaker Diarization Utilizing Face Clustering Information

Kapsouoras, Ioannis; Tefas, Anastasios; Nikolaidis, Nikos; Pitas, Ioannis

Repository landing page

research

oai:research-information.bris.ac.uk:publications/e791e28f-b767-43a4-a4e6-5d17e5debe57

Multimodal Speaker Diarization Utilizing Face Clustering Information

Authors: Ioannis Kapsouoras
Anastasios Tefas
Nikos Nikolaidis
Ioannis Pitas
Publication date: 4 August 2015
Publisher: Springer
Doi

Abstract

Multimodal clustering/diarization tries to answer the question ”who spoke when” by using audio and visual information. Diarization consists of two steps, at first segmentation of the audio information and detection of the speech segments and then clustering of the speech segments to group the speakers. This task has been mainly studied on audiovisual data from meetings, news broadcasts or talk shows. In this paper, we use visual information to aid speaker clustering. We tested the proposed method in three full length movies, i.e. a scenario muchmore difficult than the ones used so far, where there is no certainty that speech segments and video appearances of actors will always overlap. The results proved that the visual information can improve the speaker clustering accuracy and hence the diarization process

Similar works

Full text

Open in the Core reader

Download PDF

Explore Bristol Research

oai:research-information.bris....

Last time updated on 17/01/2017Provided by our Supporting member

This paper was published in Explore Bristol Research.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.