Self-supervised learning on video

Self-supervised learning on video

Many computer vision tasks rely on pre-trained feature extractors (implemented by a convolutional neural network). Contrastive methods have recently emerged as a promising way for unsupervised feature learning [1,2]. In contrast to generative approaches, no reconstruction from a latent representation needs to be carried out.

A computational challenging, yet fairly under-explorer setting is to learn video features. While [3] uses video data to learn image representations. [4] is a first example of an approach that uses video features. Training could be conducted on Kinectis or HowTo100M [5] dataset.


[1] T. Chen et al.: A simple framework for contrastive learning of visual representations
[2] X. Chen et al.: Improved baselines with momentum contrastive learning
[3] Gordon et al.: Watching the World Go By: Representation Learning from Unlabeled Videos
[4] Knights et al.: Temporally Coherent Embeddings for Self-Supervised Video Representation Learning
[5] Miech et al.: End-to-End Learning of Visual Representations from Uncurated Instructional Videos


  • Good mathematical understanding (in particular statistics and linear algebra)
  • Python programming
  • Experience in machine learning


Timo L├╝ddecke

Neural Data Science Group
Institute of Computer Science
University of Goettingen