MARCO: Navigating the Unseen Space of Semantic Correspondence

✨ CVPR 2026 Oral ✨

Claudia Cuttano^1,2, Gabriele Trivigno¹, Carlo Masone¹, Stefan Roth^2,3,4

¹Politecnico di Torino, ²TU Darmstadt, ³hessian.AI, ⁴ELIZA

MARCO learns generalizable semantic correspondence from sparse supervision, going beyond fixed keypoint vocabularies toward dense matches that transfer to unseen keypoints and novel categories.

Built on a single DINOv2 backbone, it achieves state-of-the-art correspondence while remaining 3× smaller and 10× faster than diffusion-based approaches.

Training densification

Starting from a sparse set of ground-truth keypoints, MARCO progressively builds dense supervision across the object surface: it identifies reliable matches, expands them via Delaunay triangulation, and propagates them with dense flow, turning a handful of annotations into thousands of correspondences at training time.

Applications

MARCO enables downstream applications such as semantic edit transfer, where an edit made on one object is propagated to another through correspondence, as well as semantic warping between instances, by mapping semantically corresponding regions from one object to the other and back.

Highlights

Generalization
Seen keypoints, unseen keypoints, new categories

                  MARCO is sota not only on standard benchmarks, but also when the queried point is new or the object category was never seen during training.
                

Dense supervision
10–20 keypoints become thousands of correspondences

                  Instead of learning only sparse landmarks, MARCO expands supervision across the whole object surface, building a richer training signal.
                

Precision
Big gains at strict thresholds

                  We propose a new training schedule that improves localization accuracy: not roughly the correct regions, but closer to the exact point location.
                

Efficiency
3× smaller, 10× faster

                  A single DINOv2 backbone replaces heavier diffusion-based dual-encoder pipelines while also reaching state-of-the-art performance.
                

Benchmark
A new testbed for true generalization

                  We introduce the MP-100 benchmark that explicitly evaluates transfer to novel keypoints and novel categories, beyond  classical in-domain benchmarks.
                

Why densification matters

From Sparse Landmarks to Dense Pseudo-Correspondence

Raw DINOv2 features already contain meaningful correspondence cues across an object. Standard fine-tuning on sparse landmarks improves semantics around the annotated keypoints, but makes the field collapse around them. MARCO propagates supervision across the surface, producing smoother and more geometrically consistent flow.

Source

Target

We consider a source image and a target image. From this pair, we visualize the source-to-target correspondence flow in HSV space, where color encodes the displacement field.

1. Flow consistency in DINOv2

Raw DINOv2 already produces a partially coherent correspondence field

This is the key observation behind MARCO: instead of discarding that structure, we use it as a source of self-supervision and to progressively densify correspondences during training.

Benchmark

A Benchmark for Generalization Beyond Seen Keypoints

We build a new evaluation protocol on MP-100 spanning unseen-keypoint and unseen-category, for a total of 350 unseen keypoint definitions and 62 unseen categories. This setting explicitly tests transfer to new landmark vocabularies and entirely new object categories beyond the training distribution.

Download images and annotations following our instructions.

Unseen keypoints

Unseen categories

MP-100 benchmark

Human face

Unseen-keypoint split. This domain extends the SPair-71k person category from only 7 coarse facial keypoints to a dense 68-landmark facial annotation scheme.

Categories: 1 · Avg. keypoints per pair: 68 · Keypoint definitions: 68.

Performance

Precise, Generalizable, and Fast

MARCO sets a new state of the art at strict localization thresholds, on unseen keypoints and unseen categories, while running at 10× the speed.

In-domain precision

State of the art on keypoints seen during training, with more precise localization.

Strict thresholds measure whether the match lands near the exact point, not just the right region.

SPair-71k · PCK@0.01

MARCO

27.0

Geo-SC

21.7

Jamais Vu

20.5

AP-10K cross-family · PCK@0.01

MARCO

28.5

Geo-SC

18.3

Jamais Vu

18.4

+5.3 on SPair-71k · +10.2 avg on AP-10k vs Geo-SC

Out-of-domain

Strong not only on seen keypoints, but also beyond the training vocabulary.

Previous methods look strong in-domain, then drop once the queried semantics or categories change.

SPair-U · unseen keypoints · PCK@0.10

MARCO

67.5

Jamais Vu

62.4

Geo-SC

56.9

MP-100 avg · 5 splits · PCK@0.10

MARCO

59.7

Jamais Vu

54.2

Geo-SC

53.2

+5.1 on SPair-U · +5.6 avg on MP-100 vs Jamais Vu

Efficiency

A single DINOv2-based model, without the need for diffusion features.

Inference speed

8.3 FPS

MARCO runs about 10× faster than Geo-SC and Jamais Vu, both at 0.85 FPS.

Model size

323M

Compared with 950M parameters for Geo-SC and Jamais Vu.

Contact

For questions, contact claudia.cuttano@polito.it

BibTeX

@inproceedings{cuttano2026marco,
  title     = {{MARCO}: Navigating the Unseen Space of Semantic Correspondence},
  author    = {Claudia Cuttano and Gabriele Trivigno and Carlo Masone and Stefan Roth},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgments. Claudia Cuttano was supported by the Sustainable Mobility Center (CNMS), which received funding from the European Union Next Generation EU (Piano Nazionale di Ripresa e Resilienza (PNRR), Missione 4 Componente 2 Investimento 1.4 “Potenziamento strutture di ricerca e creazione di campioni nazionali di R&S su alcune Key Enabling Technologies”) with grant agreement no. CN_00000023. Stefan Roth has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 866008). Further, he was supported by the DFG under Germany's Excellence Strategy (EXC-3057/1 “Reasonable Artificial Intelligence”, Project No. 533677015). We acknowledge the CINECA award under the ISCRA initiative, for the availability of high-performance computing resources. We acknowledge the support of the European Laboratory for Learning and Intelligent Systems (ELLIS).