Characterizing Vision Backbones for Dense Prediction with Dense Attentive Probing

Timo Lüddecke and Alexander Ecker
University of Göttingen

Paper Code

Transactions on Machine Learning Research (TMLR)

Abstract

The paradigm of pretraining a backbone on a large set of (often unlabeled) images has gained popularity. The quality of the resulting features is commonly measured by freezing the backbone and training different task heads on top of it. However, current evaluations cover only classifications of whole images or require complex dense task heads which introduce a large number of parameters and add their own inductive biases. In this work, we propose dense attentive probing, a parameter-efficient readout method for dense prediction on arbitrary backbones -- independent of the size and resolution of their feature volume. To this end, we extend cross-attention with distance-based masks of learnable sizes. We employ this method to evaluate 18 common backbones on dense predictions tasks in three dimensions: instance awareness, local semantics and spatial understanding. We find that DINOv2 outperforms all other backbones tested -- including those supervised with masks and language -- across all three task categories. Furthermore, our analysis suggests that self-supervised pretraining tends to yield features that separate object instances better than vision-language models.

Results

We train Dense Attentive Probing as a readout on top of several frozen backbones and evaluate downstream task performances for dense prediction tasks. Due to its small number of parameters, we can measure dense prediction performance directly, similar to linear probing for tasks that involve the whole image. These are our results:

	Inst. Disc.	Semseg	Depth
Random (untrained)	0.2544	0.8	0.1	23.5	2.7364	3.5	1.5140	28.1	1.757
ImageNet	0.1787	16.0	11.3	36.3	0.3924	61.0	2.0630	12.6	0.713
SAM V2 B+	0.1394	30.2	17.6	50.0	0.6005	33.9	1.4417	27.3	0.714
MoCo V3	0.1613	21.4	20.8	41.8	0.3274	63.3	1.3423	30.2	0.606
Dino	0.1433	28.3	23.4	41.5	0.3397	63.8	1.0468	42.8	0.585
Dino V2	0.1297	32.4	26.7	46.8	0.1259	83.4	1.0347	43.8	0.425
Dino V2 (ViT-L)	0.1450	33.8	28.1	46.4	0.1173	85.3	1.4822	25.3	0.398
MAE	0.1500	25.3	26.6	49.8	0.3471	60.2	1.5007	25.7	0.580
Hiera B+	0.1717	17.1	23.1	44.8	0.3344	61.5	1.3367	32.0	0.523
CLIP	0.1665	19.0	13.9	39.5	0.2710	68.3	1.2623	35.5	0.603
CLIP (ViT-L)	0.1543	24.3	20.0	40.6	0.2228	74.9	1.3427	31.9	0.515
MetaCLIP	0.1669	19.1	16.7	38.4	0.2580	69.4	1.3355	33.0	0.576
SigLIP-224	0.1718	17.2	16.8	37.8	0.2898	67.1	1.2565	35.4	0.598
SigLIP-384	0.1564	23.3	18.4	39.1	0.2137	73.9	1.2435	36.3	0.568
SigLIP-512	0.1496	26.2	19.1	38.9	0.2203	75.7	1.1737	38.8	0.565
SigLIP-SO	0.1496	25.7	19.5	40.7	0.1908	77.9	1.1957	37.5	0.488
AiM2	0.1556	23.5	20.1	40.0	0.2244	75.2	NaN	NaN	0.498

Inst. Disc.

Semseg

Depth

names

MSE (BD)

IoU (BD)

AP_lg

ARI

CE (P)

IoU (P)

CE (CC)

IoU (CC)

RMSE

Random (untrained)

0.2544

0.8

0.1

23.5

2.7364

3.5

1.5140

28.1

1.757

ImageNet

0.1787

16.0

11.3

36.3

0.3924

61.0

2.0630

12.6

0.713

SAM V2 B+

0.1394

30.2

17.6

50.0

0.6005

33.9

1.4417

27.3

0.714

MoCo V3

0.1613

21.4

20.8

41.8

0.3274

63.3

1.3423

30.2

0.606

Dino

0.1433

28.3

23.4

41.5

0.3397

63.8

1.0468

42.8

0.585

Dino V2

0.1297

32.4

26.7

46.8

0.1259

83.4

1.0347

43.8

0.425

Dino V2 (ViT-L)

0.1450

33.8

28.1

46.4

0.1173

85.3

1.4822

25.3

0.398

MAE

0.1500

25.3

26.6

49.8

0.3471

60.2

1.5007

25.7

0.580

Hiera B+

0.1717

17.1

23.1

44.8

0.3344

61.5

1.3367

32.0

0.523

CLIP

0.1665

19.0

13.9

39.5

0.2710

68.3

1.2623

35.5

0.603

CLIP (ViT-L)

0.1543

24.3

20.0

40.6

0.2228

74.9

1.3427

31.9

0.515

MetaCLIP

0.1669

19.1

16.7

38.4

0.2580

69.4

1.3355

33.0

0.576

SigLIP-224

0.1718

17.2

16.8

37.8

0.2898

67.1

1.2565

35.4

0.598

SigLIP-384

0.1564

23.3

18.4

39.1

0.2137

73.9

1.2435

36.3

0.568

SigLIP-512

0.1496

26.2

19.1

38.9

0.2203

75.7

1.1737

38.8

0.565

SigLIP-SO

0.1496

25.7

19.5

40.7

0.1908

77.9

1.1957

37.5

0.488

AiM2

0.1556

23.5

20.1

40.0

0.2244

75.2

NaN

0.498

Citation

@article{ lueddecke2025deap, title={Characterizing Vision Backbones for Dense Prediction with Dense Attentive Probing}, author={Timo Lüddecke and Alexander Ecker}, journal={Transactions on Machine Learning Research}, year={2025}, url={https://openreview.net/forum?id=neMAx4uBlh}, note={accepted} }

Characterizing Vision Backbones for Dense Prediction with Dense Attentive Probing

Abstract

Results

Self-supervised Learning yields better instance awareness

Backbone property correlations

Speed-accuracy trade-off

Method: Dense Attentive Probing

Dynamic Output Size

Method Comparison

Citation

Acknowledgements

Characterizing Vision Backbones for Dense Prediction with Dense Attentive Probing

Abstract

Results

Self-supervised Learning yields better instance awareness

Backbone property correlations

Speed-accuracy trade-off

Method: Dense Attentive Probing

Dynamic Output Size

Method Comparison

Citation Copy citation

Acknowledgements

Citation