The paradigm of pretraining a backbone on a large set of (often unlabeled) images has gained popularity. The quality of the resulting features is commonly measured by freezing the backbone and training different task heads on top of it. However, current evaluations cover only classifications of whole images or require complex dense task heads which introduce a large number of parameters and add their own inductive biases. In this work, we propose dense attentive probing, a parameter-efficient readout method for dense prediction on arbitrary backbones -- independent of the size and resolution of their feature volume. To this end, we extend cross-attention with distance-based masks of learnable sizes. We employ this method to evaluate 18 common backbones on dense predictions tasks in three dimensions: instance awareness, local semantics and spatial understanding. We find that DINOv2 outperforms all other backbones tested -- including those supervised with masks and language -- across all three task categories. Furthermore, our analysis suggests that self-supervised pretraining tends to yield features that separate object instances better than vision-language models.
We train Dense Attentive Probing as a readout on top of several frozen backbones and evaluate downstream task performances for dense prediction tasks. Due to its small number of parameters, we can measure dense prediction performance directly, similar to linear probing for tasks that involve the whole image. These are our results:
Inst. Disc. | Semseg | Depth | |||||||
---|---|---|---|---|---|---|---|---|---|
names | MSE (BD) | IoU (BD) | AP_lg | ARI | CE (P) | IoU (P) | CE (CC) | IoU (CC) | RMSE |
Random (untrained) | 0.2544 | 0.8 | 0.1 | 23.5 | 2.7364 | 3.5 | 1.5140 | 28.1 | 1.757 |
ImageNet | 0.1787 | 16.0 | 11.3 | 36.3 | 0.3924 | 61.0 | 2.0630 | 12.6 | 0.713 |
SAM V2 B+ | 0.1394 | 30.2 | 17.6 | 50.0 | 0.6005 | 33.9 | 1.4417 | 27.3 | 0.714 |
MoCo V3 | 0.1613 | 21.4 | 20.8 | 41.8 | 0.3274 | 63.3 | 1.3423 | 30.2 | 0.606 |
Dino | 0.1433 | 28.3 | 23.4 | 41.5 | 0.3397 | 63.8 | 1.0468 | 42.8 | 0.585 |
Dino V2 | 0.1297 | 32.4 | 26.7 | 46.8 | 0.1259 | 83.4 | 1.0347 | 43.8 | 0.425 |
Dino V2 (ViT-L) | 0.1450 | 33.8 | 28.1 | 46.4 | 0.1173 | 85.3 | 1.4822 | 25.3 | 0.398 |
MAE | 0.1500 | 25.3 | 26.6 | 49.8 | 0.3471 | 60.2 | 1.5007 | 25.7 | 0.580 |
Hiera B+ | 0.1717 | 17.1 | 23.1 | 44.8 | 0.3344 | 61.5 | 1.3367 | 32.0 | 0.523 |
CLIP | 0.1665 | 19.0 | 13.9 | 39.5 | 0.2710 | 68.3 | 1.2623 | 35.5 | 0.603 |
CLIP (ViT-L) | 0.1543 | 24.3 | 20.0 | 40.6 | 0.2228 | 74.9 | 1.3427 | 31.9 | 0.515 |
MetaCLIP | 0.1669 | 19.1 | 16.7 | 38.4 | 0.2580 | 69.4 | 1.3355 | 33.0 | 0.576 |
SigLIP-224 | 0.1718 | 17.2 | 16.8 | 37.8 | 0.2898 | 67.1 | 1.2565 | 35.4 | 0.598 |
SigLIP-384 | 0.1564 | 23.3 | 18.4 | 39.1 | 0.2137 | 73.9 | 1.2435 | 36.3 | 0.568 |
SigLIP-512 | 0.1496 | 26.2 | 19.1 | 38.9 | 0.2203 | 75.7 | 1.1737 | 38.8 | 0.565 |
SigLIP-SO | 0.1496 | 25.7 | 19.5 | 40.7 | 0.1908 | 77.9 | 1.1957 | 37.5 | 0.488 |
AiM2 | 0.1556 | 23.5 | 20.1 | 40.0 | 0.2244 | 75.2 | NaN | NaN | 0.498 |
Last updated: 2025-09-11
The idea of Dense Attentive Probing is to use cross-attention as a minimal task head on top of frozen backbones. Frozen means that no weights can be changed. Due to the small number of parameters, we directly assess the quality of the backbone's features for the respective task.
Our method generates outputs for different tasks using a single cross-attention layers that attend to the feature volume and whose queries are responsible for generating a local region of the output. We enable the queries to focus on local regions in the feature volume (i.e. neighborhoods in the feature volume close to the queries location) through masking the attention.
@article{
lueddecke2025deap,
title={Characterizing Vision Backbones for Dense Prediction with Dense Attentive Probing},
author={Timo Lüddecke and Alexander Ecker},
journal={Transactions on Machine Learning Research},
year={2025},
url={https://openreview.net/forum?id=neMAx4uBlh},
note={accepted}
}
This publication was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - Project-ID 454648639 - SFB 1528