INSID3: Training-Free In-Context Segmentation with DINOv3

Highlights

Architecture
Single frozen backbone

                  INSID3 performs in-context segmentation directly from DINOv3 features, without segmentation decoders, fine-tuning, or auxiliary models.
                

Dense features
Emergent segmentation

                  DINOv3 dense features naturally form coherent object- and part-level groups, enabling structured region decomposition of the target image.
                

Debiasing
Positional bias

                  We identify a positional bias in DINOv3 features and remove it with a simple training-free projection, disentangling position from semantics to improve correspondences.
                

Generalization
Across domains and granularities

                  INSID3 handles object-level, part-level, and personalized segmentation across natural, medical, underwater, and aerial domains with a unified training-free solution.
                

Generalization

Strong Generalization Across Granularities and Domains

INSID3 generalizes across object-level, part-level, and personalized segmentation, and across diverse domains including natural, medical, underwater and aerial imagery.

Fastest among compared methods

In frames per second, INSID3 runs substantially faster than DINOv2 + SAM based pipelines.

Matcher

DINOv2 + SAM

0.11 FPS

GF-SAM

DINOv2 + SAM

0.97 FPS

INSID3 (ours)

DINOv3

3.31 FPS

Higher is better. INSID3 runs about 3.4× faster than GF-SAM and about 29.8× faster than Matcher.

Dense DINOv3 Features

Region-level Grouping from DINOv3

Dense DINOv3 features naturally induce a structured decomposition of the scene. By clustering them, we obtain coherent object- and part-level regions without supervision, directly enabling segmentation in feature space.

ImageClusters

ImageClusters

ImageClusters

ImageClusters

ImageClusters

ImageClusters

Debiasing

Uncovering and Removing Positional Bias

Reference

Target

Similarity w/ DINOv3

w/ Debiased DINOv3

Given a reference embedding on the baseball bat and a target image, the resulting similarity map should ideally highlight the corresponding bat region. However, besides semantic matches, it also responds to absolute image position: in addition to the baseball bat, it activates in the top-left region, mirroring the bat’s location in the reference image.

A low-dimensional positional subspace in DINOv3 features

PCA on low-semantic-content images reveals that this effect lives in a stable low-dimensional subspace. INSID3 removes it in a simple training-free way: we identify the positional component of DINOv3 features and project onto its orthogonal complement. This suppresses coordinate-driven responses while preserving semantics.

And what about DINOv2?

Positional bias comparison between DINOv2 and DINOv3

This positional bias is much weaker in DINOv2. This suggests that DINOv3 encodes stronger positional information, likely as a by-product of its training objective: local-consistency constraints and Gram-anchoring may amplify absolute spatial correlations when semantic cues are weak, and its different positional encoding may further reinforce this effect.

Contact

For questions, contact claudia.cuttano@polito.it

BibTeX

@inproceedings{cuttano2026insid3,
  title     = {{INSID3}: Training-Free In-Context Segmentation with {DINOv3}},
  author    = {Claudia Cuttano and Gabriele Trivigno and Christoph Reich and Daniel Cremers and Carlo Masone and Stefan Roth},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgments. Claudia Cuttano was supported by the Sustainable Mobility Center (CNMS), which received funding from the European Union Next Generation EU (Piano Nazionale di Ripresa e Resilienza (PNRR), Missione 4 Componente 2 Investimento 1.4 “Potenziamento strutture di ricerca e creazione di campioni nazionali di R&S su alcune Key Enabling Technologies”) with grant agreement no. CN_00000023. Christoph Reich is supported by the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) through the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the German Federal Ministry of Education and Research. Stefan Roth has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 866008). Further, he was supported by the DFG under Germany's Excellence Strategy (EXC-3057/1 “Reasonable Artificial Intelligence”, Project No. 533677015) and by the LOEWE initiative (Hesse, Germany) within the emergenCITY center [LOEWE/1/12/519/03/05.001(0016)/72]. Daniel Cremers has received funding by the European Research Council (ERC) Advanced Grant SIMULACRON (grant agreement No. 884679). We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources. We also acknowledge the support of the European Laboratory for Learning and Intelligent Systems (ELLIS). Finally, we thank Barı̧ş Zöngür for the insightful feedback.