Self-Supervised Pretraining for Training Robust Monkey Detection and Tracking Models

Self-Supervised Pretraining for Training Robust Monkey Detection and Tracking Models


Developing robust models capable of accurately detecting and tracking monkeys in diverse environmental conditions remains a significant challenge due to variations in pose, lighting conditions, and occlusions. This thesis proposal aims to explore the application of self-supervised pretraining techniques to enhance the performance of monkey detection and tracking models.


The main objective of this research is to investigate the effectiveness of self-supervised pretraining methods in improving the robustness and generalization capabilities of monkey detection and tracking models. By leveraging large amounts of unlabeled visual data, the proposed approach aims to learn meaningful representations that capture relevant visual cues and can be transferred to downstream detection and tracking tasks.


  • Literature Review: Conduct a comprehensive review of monkey detection and tracking literature to identify gaps and challenges suitable for self-supervised pretraining. Additionally, develop an understanding of the state-of-the-art self-supervised methods for image and video data.

  • Image-based Self-Supervised Pretraining: Utilize relevant self-supervised methods, such as contrastive learning, to train robust image-based representations from unlabeled monkey images, incorporating image-level context.

  • Video-based Self-Supervised Pretraining: Utilize video-specific self-supervised techniques, such as VideoMae[1], to learn spatiotemporal representations from unlabeled monkey videos, prioritizing the capture of temporal context for enhanced robustness and accuracy in monkey detection and tracking models.

  • Compare image-based pretraining and video-based pretraining in the context of monkey detection and tracking

Expected Outcomes

  • Improved robustness and generalization of monkey detection and tracking models in image and video domains.

  • Identification of effective self-supervised pretraining techniques for image-based and video-based models.

  • Insights into the impact of architectural choices and data augmentation on model performance.


  • Python programming
  • Prior Knowledge of deep learning, in particular self-supervised learning paradigms is a plus.
  • Data analysis experience is a plus.
  • Experience with PyTorch is a plus


[1] Tong, Zhan, et al. “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.” arXiv preprint arXiv:2203.12602 (2022).


Sharmita Dey

Neural Data Science Group
Institute of Computer Science
University of Goettingen