Using Video Backbone Models for Monkey Detection and Tracking Models

Using Video Backbone Models for Monkey Detection and Tracking Models

Motivation

Developing robust models capable of accurately detecting and tracking monkeys in diverse environmental conditions remains a significant challenge due to variations in pose, lighting conditions, and occlusions. Now, we have a new tool available: In recent years, there has been significant progress in self-supervised learning that produced video backbone models that were pretrained on large-scale unlabeled video data, for example V-JEPA [1] or Hiera [2]. Those models produce high-level representations of videos that can be used for several downstream tasks.

Thesis

Our goal is to explore how well pretrained video backbones can be employed for detection and tracking, specifically for challenging in-the-wild animal settings.

The steps for this thesis are:

  • Conduct a review of existing Transformer-based tracking methods, self-supervised training methods, and monkey tracking datasets and baseline methods
  • Design a model architecture for tracking based on a video backbones (e.g. V-JEPA)
  • Training and evaluating the tracking model (finetuning the video backbone if needed)
  • Explore robustness and generalizability of the model by comparing performance on different datasets

Contact

To apply please email Felix Benjamin Müller stating your interest in this project and detailing your relevant skills. Please note that this project is better suited for a student who already has a basic familiarity with deep learning and pytorch.

References

[1] Bardes, Adrien, et al. “Revisiting feature prediction for learning visual representations from video.” arXiv preprint arXiv:2404.08471 (2024).

[2] Ryali, Chaitanya, et al. “Hiera: A hierarchical vision transformer without the bells-and-whistles.” International Conference on Machine Learning. PMLR, 2023.

Neural Data Science Group
Institute of Computer Science
University of Goettingen