Data-centric learning for primate action recognition

Data-centric learning for primate action recognition

Data-centric learning represents a paradigm shift in how we approach artificial intelligence and machine learning development. While traditional machine learning has predominantly focused on model architecture and algorithm optimization—the “model-centric” approach—data-centric learning emphasizes the systematic improvement of data quality, composition, and labeling as the primary driver of model performance. This is particularly exciting as it opens novel research directions at the intersection of data quality assessment, automated data cleaning, and intelligent data synthesis.

In this project, we want to use data-centric methods to build better video-based models for studying how primates behave in the wild. Our group has collected many primate videos from different sources - research partners, YouTube, and published datasets. We expect that using data-centric learning will lead to far better model performance than naively sampling videos from our data collection

Project Goals and Methodology

The main goal is to develop methods that can automatically find the most useful training videos from our large collection. We want to develop automated ways to check video quality and make sure we have diverse examples. The project has several main steps:

  1. Feature Extraction: Getting useful information (features) from our videos using both classical approaches (e.g. histograms) and pretrained machine learning models like CLIP
  2. Heuristic Approaches: Developing and applying heuristics for measuring video usefulness. This means looking at both how good the video quality is and how different it is from other videos we already have.
  3. Evaluation and Refinement: Evaluating and refining those heuristics to get an understanding about how useful videos look like
  4. Applying Machine Learning: Based on insights gained from the previous step, design and training of a usefulness classifier to select good training data
  5. Checking Performance: Benchmarking of your approaches by training a primate action recognition model with fixed architecture and hyperparameters on the datasets you generate

Candidate Profile

This project is great for you if you want to dive into the more engineering-heavy aspects of modern machine learning, i.e. designing pipelines to process terabytes of video data. It is best suited for students with a computer science background and basic familiarity with machine learning, but we also welcome applications from different backgrounds. We will provide access to high-performance computing resources and necessary technical help.

Contact

To apply please email Felix Benjamin Müller stating your interest in this project and detailing your relevant skills.

Neural Data Science Group
Institute of Computer Science
University of Goettingen