MARCO: Navigating the Unseen Space of Semantic Correspondence

✨ CVPR 2026 Oral ✨
1Politecnico di Torino, 2TU Darmstadt, 3hessian.AI, 4ELIZA

MARCO learns generalizable semantic correspondence from sparse supervision, going beyond fixed keypoint vocabularies toward dense matches that transfer to unseen keypoints and novel categories.

Built on a single DINOv2 backbone, it achieves state-of-the-art correspondence while remaining 3× smaller and 10× faster than diffusion-based approaches.

Training densification

Starting from a sparse set of ground-truth keypoints, MARCO progressively builds dense supervision across the object surface: it identifies reliable matches, expands them via Delaunay triangulation, and propagates them with dense flow, turning a handful of annotations into thousands of correspondences at training time.

Applications

MARCO enables downstream applications such as semantic edit transfer, where an edit made on one object is propagated to another through correspondence, as well as semantic warping between instances, by mapping semantically corresponding regions from one object to the other and back.

Highlights

Generalization
Seen keypoints, unseen keypoints, new categories
MARCO is sota not only on standard benchmarks, but also when the queried point is new or the object category was never seen during training.
Dense supervision
10–20 keypoints become thousands of correspondences
Instead of learning only sparse landmarks, MARCO expands supervision across the whole object surface, building a richer training signal.
Precision
Big gains at strict thresholds
We propose a new training schedule that improves localization accuracy: not roughly the correct regions, but closer to the exact point location.
Efficiency
3× smaller, 10× faster
A single DINOv2 backbone replaces heavier diffusion-based dual-encoder pipelines while also reaching state-of-the-art performance.
Benchmark
A new testbed for true generalization
We introduce the MP-100 benchmark that explicitly evaluates transfer to novel keypoints and novel categories, beyond classical in-domain benchmarks.
Why densification matters

From Sparse Landmarks to Dense Pseudo-Correspondence

Raw DINOv2 features already contain meaningful correspondence cues across an object. Standard fine-tuning on sparse landmarks improves semantics around the annotated keypoints, but makes the field collapse around them. MARCO propagates supervision across the surface, producing smoother and more geometrically consistent flow.

Source image
Source
Target image
Target

We consider a source image and a target image. From this pair, we visualize the source-to-target correspondence flow in HSV space, where color encodes the displacement field.

1. Flow consistency in DINOv2

Raw DINOv2 already produces a partially coherent correspondence field

This is the key observation behind MARCO: instead of discarding that structure, we use it as a source of self-supervision and to progressively densify correspondences during training.

Benchmark

A Benchmark for Generalization Beyond Seen Keypoints

We build a new evaluation protocol on MP-100 spanning unseen-keypoint and unseen-category, for a total of 350 unseen keypoint definitions and 62 unseen categories. This setting explicitly tests transfer to new landmark vocabularies and entirely new object categories beyond the training distribution.

Download images and annotations following our instructions.

Unseen keypoints
Unseen categories
Human face sample 1
Human face sample 2
Human face sample 3
Apparel item sample 1
Apparel item sample 2
Apparel item sample 3
Animal face sample 1
Animal face sample 2
Animal face sample 3
Animal body sample 1
Animal body sample 2
Animal body sample 3
Home furniture sample 1
Home furniture sample 2
Home furniture sample 3
MP-100 benchmark

Human face

Unseen-keypoint split. This domain extends the SPair-71k person category from only 7 coarse facial keypoints to a dense 68-landmark facial annotation scheme.

Categories: 1 · Avg. keypoints per pair: 68 · Keypoint definitions: 68.

Performance

Precise, Generalizable, and Fast

MARCO sets a new state of the art at strict localization thresholds, on unseen keypoints and unseen categories, while running at 10× the speed.

In-domain precision
State of the art on keypoints seen during training, with more precise localization.
Strict thresholds measure whether the match lands near the exact point, not just the right region.
SPair-71k · PCK@0.01
MARCO
27.0
Geo-SC
21.7
Jamais Vu
20.5
AP-10K cross-family · PCK@0.01
MARCO
28.5
Geo-SC
18.3
Jamais Vu
18.4
+5.3 on SPair-71k · +10.2 avg on AP-10k vs Geo-SC
Out-of-domain
Strong not only on seen keypoints, but also beyond the training vocabulary.
Previous methods look strong in-domain, then drop once the queried semantics or categories change.
SPair-U · unseen keypoints · PCK@0.10
MARCO
67.5
Jamais Vu
62.4
Geo-SC
56.9
MP-100 avg · 5 splits · PCK@0.10
MARCO
59.7
Jamais Vu
54.2
Geo-SC
53.2
+5.1 on SPair-U · +5.6 avg on MP-100 vs Jamais Vu
Efficiency
A single DINOv2-based model, without the need for diffusion features.
Inference speed
8.3 FPS
MARCO runs about 10× faster than Geo-SC and Jamais Vu, both at 0.85 FPS.
Model size
323M
Compared with 950M parameters for Geo-SC and Jamais Vu.

Contact

For questions, contact claudia.cuttano@polito.it

BibTeX

@inproceedings{cuttano2026marco,
  title     = {{MARCO}: Navigating the Unseen Space of Semantic Correspondence},
  author    = {Claudia Cuttano and Gabriele Trivigno and Carlo Masone and Stefan Roth},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgments. Claudia Cuttano was supported by the Sustainable Mobility Center (CNMS), which received funding from the European Union Next Generation EU (Piano Nazionale di Ripresa e Resilienza (PNRR), Missione 4 Componente 2 Investimento 1.4 “Potenziamento strutture di ricerca e creazione di campioni nazionali di R&S su alcune Key Enabling Technologies”) with grant agreement no. CN_00000023. Stefan Roth has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 866008). Further, he was supported by the DFG under Germany's Excellence Strategy (EXC-3057/1 “Reasonable Artificial Intelligence”, Project No. 533677015). We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources. We acknowledge the support of the European Laboratory for Learning and Intelligent Systems (ELLIS).