In this work
we propose a generative model based approach to combine audio and video modalities
for person tracking. We demonstrate a principled and intuitive approach towards
combining these modalities to obtain robustness against occlusion and change
in appearance. We further exploit the temporal correlations that exist for
a moving object between adjacent frames to account for the cases where having
both modalities might still not be enough, e.g., when person is occluded and
not speaking. Improvement in tracking results is shown at each step and compared
with the manually annotated ground truth. Tracking results obtained by our
algorithm are comparable to ground truth.
Tracking Result Videos
Beal
et al. Video Model on Toy Data
Modified Video Model on Real Data
Video demonstrating loss of track due
to change in appearance
Adding
temporal constraints helps tracker regain the track
Tracking with video
Tracking with Audio + Video
Tracking with Audio + Video +
Temporal Constraints