INSID3: Training-Free In-Context Segmentation with DINOv3

✨ CVPR 2026 ✨
1Politecnico di Torino, 2TU Darmstadt, 3TU Munich, 4hessian.AI, 5ELIZA, 6MCML

INSID3 performs in-context segmentation directly from a single frozen DINOv3 backbone, without fine-tuning, segmentation decoders, or auxiliary models.

Annotate one or a few examples of a reference category, and INSID3 segments that category in new images.

Highlights

Architecture
Single frozen backbone
INSID3 performs in-context segmentation directly from DINOv3 features, without segmentation decoders, fine-tuning, or auxiliary models.
Dense features
Emergent segmentation
DINOv3 dense features naturally form coherent object- and part-level groups, enabling structured region decomposition of the target image.
Debiasing
Positional bias
We identify a positional bias in DINOv3 features and remove it with a simple training-free projection, disentangling position from semantics to improve correspondences.
Generalization
Across domains and granularities
INSID3 handles object-level, part-level, and personalized segmentation across natural, medical, underwater, and aerial domains with a unified training-free solution.
Generalization

Strong Generalization Across Granularities and Domains

Performance radar plot

INSID3 generalizes across object-level, part-level, and personalized segmentation, and across diverse domains including natural, medical, underwater and aerial imagery.

Fastest among compared methods

In frames per second, INSID3 runs substantially faster than DINOv2 + SAM based pipelines.

Matcher
DINOv2 + SAM
0.11 FPS
GF-SAM
DINOv2 + SAM
0.97 FPS
INSID3 (ours)
DINOv3
3.31 FPS

Higher is better. INSID3 runs about 3.4× faster than GF-SAM and about 29.8× faster than Matcher.

Dense DINOv3 Features

Region-level Grouping from DINOv3

Dense DINOv3 features naturally induce a structured decomposition of the scene. By clustering them, we obtain coherent object- and part-level regions without supervision, directly enabling segmentation in feature space.

Clustered map 1 Input image 1
ImageClusters
Clustered map 2 Input image 2
ImageClusters
Clustered map 3 Input image 3
ImageClusters
Clustered map 4 Input image 4
ImageClusters
Clustered map 5 Input image 5
ImageClusters
Clustered map 6 Input image 6
ImageClusters
Debiasing

Uncovering and Removing Positional Bias

Reference
Support image
Target
Query image
Similarity w/ DINOv3
Similarity map with DINOv3
w/ Debiased DINOv3
Similarity map with debiased DINOv3

Given a reference embedding on the baseball bat and a target image, the resulting similarity map should ideally highlight the corresponding bat region. However, besides semantic matches, it also responds to absolute image position: in addition to the baseball bat, it activates in the top-left region, mirroring the bat’s location in the reference image.

A low-dimensional positional subspace in DINOv3 features
PCA component 1
PCA component 2
PCA component 3
PCA component 4
PCA component 5
PCA component 6
PCA component 7
PCA component 8
PCA component 9
PCA component 10
PCA component 11
PCA component 12
PCA component 13
PCA component 14
PCA component 15
PCA component 16
PCA component 17
PCA component 18

PCA on low-semantic-content images reveals that this effect lives in a stable low-dimensional subspace. INSID3 removes it in a simple training-free way: we identify the positional component of DINOv3 features and project onto its orthogonal complement. This suppresses coordinate-driven responses while preserving semantics.

And what about DINOv2?
Positional bias comparison between DINOv2 and DINOv3

This positional bias is much weaker in DINOv2. This suggests that DINOv3 encodes stronger positional information, likely as a by-product of its training objective: local-consistency constraints and Gram-anchoring may amplify absolute spatial correlations when semantic cues are weak, and its different positional encoding may further reinforce this effect.

Contact

For questions, contact claudia.cuttano@polito.it

BibTeX

@inproceedings{cuttano2026insid3,
  title     = {{INSID3}: Training-Free In-Context Segmentation with {DINOv3}},
  author    = {Claudia Cuttano and Gabriele Trivigno and Christoph Reich and Daniel Cremers and Carlo Masone and Stefan Roth},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgments. Claudia Cuttano was supported by the Sustainable Mobility Center (CNMS), which received funding from the European Union Next Generation EU (Piano Nazionale di Ripresa e Resilienza (PNRR), Missione 4 Componente 2 Investimento 1.4 “Potenziamento strutture di ricerca e creazione di campioni nazionali di R&S su alcune Key Enabling Technologies”) with grant agreement no. CN_00000023. Christoph Reich is supported by the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) through the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the German Federal Ministry of Education and Research. Stefan Roth has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 866008). Further, he was supported by the DFG under Germany's Excellence Strategy (EXC-3057/1 “Reasonable Artificial Intelligence”, Project No. 533677015) and by the LOEWE initiative (Hesse, Germany) within the emergenCITY center [LOEWE/1/12/519/03/05.001(0016)/72]. Daniel Cremers has received funding by the European Research Council (ERC) Advanced Grant SIMULACRON (grant agreement No. 884679). We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources. We also acknowledge the support of the European Laboratory for Learning and Intelligent Systems (ELLIS). Finally, we thank Barı̧ş Zöngür for the insightful feedback.