Background-free latents for improved generalization

Source: https://openai.com/index/clip/
Motivation
Background-bias, where the model learns spurios correlations instead of the intended classification task, is an important and very common issue in computer vision. In recent years, it became more and more common to use latent representations from large pretrained models like CLIP/SigLIP2 or DINOv2 instead of raw pixel inputs as input to models to solve various computer vision tasks. This project asks the question: Can we reduce background bias and improve generalization by removing background information from latent representations before the actual model sees them?
Project
We are exploring options to remove background information from CLIP (and possibly DINOv2) latent representations. For this we try to exploit textual information about the background and foreground information (CLIP embeddings are aligned with a text encoder) and/or paired background-only images. We will likely work with the PanAf-FGBG dataset and similar background bias datasets.
This project will involve: Extracting latent representation for datasets using large pretrained models; Manipulating those latent representations using PyTorch or NumPy; Training Transformer-based classification models that take the modified latent representations as input.
Prerequisites
The following skills are required:
- Familiar with Python (Pytorch, Numpy, Scipy would be helpful)
- Basic machine learning and deep learning knowledge
Contact
To apply please email Felix Müller stating your interest in this project and detailing your relevant skills. A part of this project could be also a lab rotation.
