Audio Visual Speaker Localization Using Generative Models

Akash Kushal, Mandar Rahurkar


ABSTRACT

In this work we propose a generative model based approach to combine audio and video modalities for person tracking. We demonstrate a principled and intuitive approach towards combining these modalities to obtain robustness against occlusion and change in appearance. We further exploit the temporal correlations that exist for a moving object between adjacent frames to account for the cases where having both modalities might still not be enough, e.g., when person is occluded and not speaking. Improvement in tracking results is shown at each step and compared with the manually annotated ground truth. Tracking results obtained by our algorithm are comparable to ground truth.

 


Tracking Result Videos

Beal et al. Video Model on Toy Data
Modified Video Model on Real Data

Video demonstrating loss of track due to change in appearance
Adding temporal constraints helps tracker regain the track

Tracking with video
Tracking with Audio + Video
Tracking with Audio + Video + Temporal Constraints