MARCO learns generalizable semantic correspondence from sparse supervision, going beyond fixed keypoint vocabularies toward dense matches that transfer to unseen keypoints and novel categories.
Built on a single DINOv2 backbone, it achieves state-of-the-art correspondence while remaining 3× smaller and 10× faster than diffusion-based approaches.
Raw DINOv2 features already contain meaningful correspondence cues across an object. Standard fine-tuning on sparse landmarks improves semantics around the annotated keypoints, but makes the field collapse around them. MARCO propagates supervision across the surface, producing smoother and more geometrically consistent flow.
We consider a source image and a target image. From this pair, we visualize the source-to-target correspondence flow in HSV space, where color encodes the displacement field.
This is the key observation behind MARCO: instead of discarding that structure, we use it as a source of self-supervision and to progressively densify correspondences during training.
We build a new evaluation protocol on MP-100 spanning unseen-keypoint and unseen-category, for a total of 350 unseen keypoint definitions and 62 unseen categories. This setting explicitly tests transfer to new landmark vocabularies and entirely new object categories beyond the training distribution.
Download images and annotations following our instructions.















Unseen-keypoint split. This domain extends the SPair-71k person category from only 7 coarse facial keypoints to a dense 68-landmark facial annotation scheme.
Categories: 1 · Avg. keypoints per pair: 68 · Keypoint definitions: 68.
MARCO sets a new state of the art at strict localization thresholds, on unseen keypoints and unseen categories, while running at 10× the speed.
For questions, contact claudia.cuttano@polito.it
@inproceedings{cuttano2026marco,
title = {{MARCO}: Navigating the Unseen Space of Semantic Correspondence},
author = {Claudia Cuttano and Gabriele Trivigno and Carlo Masone and Stefan Roth},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}
Acknowledgments. Claudia Cuttano was supported by the Sustainable Mobility Center (CNMS), which received funding from the European Union Next Generation EU (Piano Nazionale di Ripresa e Resilienza (PNRR), Missione 4 Componente 2 Investimento 1.4 “Potenziamento strutture di ricerca e creazione di campioni nazionali di R&S su alcune Key Enabling Technologies”) with grant agreement no. CN_00000023. Stefan Roth has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 866008). Further, he was supported by the DFG under Germany's Excellence Strategy (EXC-3057/1 “Reasonable Artificial Intelligence”, Project No. 533677015). We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources. We acknowledge the support of the European Laboratory for Learning and Intelligent Systems (ELLIS).