Characterizing Vision Backbones for Dense Prediction with Dense Attentive Probing

Timo Lüddecke and Alexander Ecker
University of Göttingen

Abstract

The paradigm of pretraining a backbone on a large set of (often unlabeled) images has gained popularity. The quality of the resulting features is commonly measured by freezing the backbone and training different task heads on top of it. However, current evaluations cover only classifications of whole images or require complex dense task heads which introduce a large number of parameters and add their own inductive biases. In this work, we propose dense attentive probing, a parameter-efficient readout method for dense prediction on arbitrary backbones -- independent of the size and resolution of their feature volume. To this end, we extend cross-attention with distance-based masks of learnable sizes. We employ this method to evaluate 18 common backbones on dense predictions tasks in three dimensions: instance awareness, local semantics and spatial understanding. We find that DINOv2 outperforms all other backbones tested -- including those supervised with masks and language -- across all three task categories. Furthermore, our analysis suggests that self-supervised pretraining tends to yield features that separate object instances better than vision-language models.

Results

We train Dense Attentive Probing as a readout on top of several frozen backbones and evaluate downstream task performances for dense prediction tasks. Due to its small number of parameters, we can measure dense prediction performance directly, similar to linear probing for tasks that involve the whole image. These are our results:

Inst. Disc. Semseg Depth
names MSE (BD) IoU (BD) AP_lg ARI CE (P) IoU (P) CE (CC) IoU (CC) RMSE
Random (untrained) 0.2544 0.8 0.1 23.5 2.7364 3.5 1.5140 28.1 1.757
ImageNet 0.1787 16.0 11.3 36.3 0.3924 61.0 2.0630 12.6 0.713
SAM V2 B+ 0.1394 30.2 17.6 50.0 0.6005 33.9 1.4417 27.3 0.714
MoCo V3 0.1613 21.4 20.8 41.8 0.3274 63.3 1.3423 30.2 0.606
Dino 0.1433 28.3 23.4 41.5 0.3397 63.8 1.0468 42.8 0.585
Dino V2 0.1297 32.4 26.7 46.8 0.1259 83.4 1.0347 43.8 0.425
Dino V2 (ViT-L) 0.1450 33.8 28.1 46.4 0.1173 85.3 1.4822 25.3 0.398
MAE 0.1500 25.3 26.6 49.8 0.3471 60.2 1.5007 25.7 0.580
Hiera B+ 0.1717 17.1 23.1 44.8 0.3344 61.5 1.3367 32.0 0.523
CLIP 0.1665 19.0 13.9 39.5 0.2710 68.3 1.2623 35.5 0.603
CLIP (ViT-L) 0.1543 24.3 20.0 40.6 0.2228 74.9 1.3427 31.9 0.515
MetaCLIP 0.1669 19.1 16.7 38.4 0.2580 69.4 1.3355 33.0 0.576
SigLIP-224 0.1718 17.2 16.8 37.8 0.2898 67.1 1.2565 35.4 0.598
SigLIP-384 0.1564 23.3 18.4 39.1 0.2137 73.9 1.2435 36.3 0.568
SigLIP-512 0.1496 26.2 19.1 38.9 0.2203 75.7 1.1737 38.8 0.565
SigLIP-SO 0.1496 25.7 19.5 40.7 0.1908 77.9 1.1957 37.5 0.488
AiM2 0.1556 23.5 20.1 40.0 0.2244 75.2 NaN NaN 0.498

Last updated: 2025-09-11

Self-supervised Learning yields better instance awareness

Here we specifically focus on the instance awareness tasks, i.e. we assess how well individual instances are encoded in the backbone features. Specifically we consider the tasks object detection, instance boundary prediction and instance discrimination.
Performance on instance-awareness tasks. Self-supervised methods outperform vision-language methods.

Backbone property correlations

Using the data from above, we analyze to which degree performance is related to general properties of the backbone (image size, feature dimension, number of parameters), independent of the training method.
Correlation of task performance with backbone properties (p-values in brackets).
Semantic segmentation and depth estimation improve with larger backbone size (in terms of parameters) and higher input resolution, showing significant positive correlations. Semantic segmentation also correlates slightly with spatial feature size. Object detection shows weaker correlations with all three backbone variables, suggesting less sensitivity to these factors.

Speed-accuracy trade-off

In practice, inference speed and memory requirements often matter. If these are constrained, a trade-off must be found. The following figure helps to find the right backbone depending on task and inference compute.
Speed-accuracy trade-off in object detection, depth estimation and semantic segmentation.

Method: Dense Attentive Probing

The idea of Dense Attentive Probing is to use cross-attention as a minimal task head on top of frozen backbones. Frozen means that no weights can be changed. Due to the small number of parameters, we directly assess the quality of the backbone's features for the respective task.

General architecture: We feed the features from frozen backbones into the DEAP adapter which consists of two cross attention layers followed by deconvolutions.

Our method generates outputs for different tasks using a single cross-attention layers that attend to the feature volume and whose queries are responsible for generating a local region of the output. We enable the queries to focus on local regions in the feature volume (i.e. neighborhoods in the feature volume close to the queries location) through masking the attention.

The attention mask depends on the query location and a learnable parameter σ.

Dynamic Output Size

It is possible to generate dense predictions at different resolutions than the training resolution. In some cases, more details become visible this way.
The output size can be changed after training.

Method Comparison

How about other methods for dense probing? We compare to three baselines: Bicubic upscaling, a CNN readout and a FeatUp-based upscaler.
Comparison of our method with a CNN baseline.
For more information, see our Paper and the corresponding source code.

Citation

@article{
  lueddecke2025deap,
  title={Characterizing Vision Backbones for Dense Prediction with Dense Attentive Probing},
  author={Timo Lüddecke and Alexander Ecker},
  journal={Transactions on Machine Learning Research},
  year={2025},
  url={https://openreview.net/forum?id=neMAx4uBlh},
  note={accepted} 
}
  

Acknowledgements

This publication was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - Project-ID 454648639 - SFB 1528