Scene-Centric Unsupervised Video Panoptic Segmentation

1TU Munich    2TU Darmstadt    3NVIDIA    4University of Oxford    5MCML    6ELIZA    7hessian.AI
*equal contribution    equal supervision

CVPR 2026


TL;DR: We introduce the task of unsupervised video panoptic segmentation, along with a unified evaluation protocol and four competitive baselines. We propose VideoCUPS, the first method that directly tackles it, generating temporally consistent panoptic pseudo-labels from monocular scene-centric videos using depth, motion, and visual cues, then training an accurate VPS model with a novel Video DropLoss.


Abstract

Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose VideoCUPS, the first unsupervised VPS approach. VideoCUPS generates temporally consistent panoptic video pseudo-labels from scene-centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo-labels using a novel Video DropLoss yields an accurate, unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state-of-the-art unsupervised panoptic image and instance video segmentation models to VPS. VideoCUPS outperforms all baselines and demonstrates strong label-efficient learning. With VideoCUPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.


Method

VideoCUPS pseudo-label generation pipeline

VideoCUPS generates temporally consistent video panoptic pseudo-labels from monocular scene-centric videos. Instance pseudo-labels come from motion-based region growing on self-supervised optical flow (SMURF) and depth (DynamoDepth). Semantic pseudo-labels come from distilled DINO features (k-means) refined by depth-guided inference. Temporally, instances are propagated and tracked via warped-IoU + Hungarian matching, semantics are smoothed by a 3-frame majority vote, and semantic↔instance assignments are aligned per clip. A pixel-ratio threshold splits things from stuff. The resulting pseudo-labels supervise a Panoptic Cascade MaskTrack R-CNN with DINO ResNet-50, trained with our Video DropLoss and self-enhanced video copy-paste augmentation.


Quantitative results

We compare VideoCUPS against four competitive baselines that pair state-of-the-art unsupervised semantic, video-instance, or panoptic image segmentation methods with unsupervised tracking, evaluated with STQ, AQ, and SQ (all in %, higher is better) on Cityscapes-VPS, KITTI-STEP, Waymo, and the out-of-domain MOTS benchmark. Despite training only on monocular videos, VideoCUPS outperforms every baseline on every dataset and metric, including CUPS + SORT, which uses stereo video at training time.

CUPS trained with monocular videos. Supervised reference in gray.

Method Cityscapes KITTI-STEP Waymo MOTS (OOD)
STQAQSQ STQAQSQ STQAQSQ STQAQSQ
Supervised 42.027.065.3 53.959.948.4 22.312.639.4 20.512.733.1
DepthG + VideoCutLER 9.93.428.2 13.28.720.1 7.92.623.9 14.56.830.7
U2Seg + SORT 11.45.623.0 24.021.127.2 10.44.822.6 14.97.230.8
CUPS + SORT 20.613.331.8 34.237.731.1 17.59.930.8 16.710.427.0
CUPS + SORT 17.810.629.9 32.935.430.5 16.69.329.8 14.97.828.3
VideoCUPS (ours) 22.215.332.3 37.343.632.0 18.410.731.6 18.610.533.0

Label-efficient learning

Label-efficient learning plot

We fine-tune a VideoCUPS-pretrained model on varying fractions of Cityscapes-VPS labels and compare against the same architecture initialized with DINO. With only 10% of the labels, VideoCUPS already reaches the STQ of a randomly initialized supervised model trained on the full Cityscapes-VPS train set, and improves over the DINO-initialized baseline by 4.6% STQ. At 100% labels, VideoCUPS still improves over DINO by +2.6% STQ, +2.3% AQ, and +3.5% SQ.


Qualitative comparisons

Below, we showcase video panoptic predictions on Cityscapes demo video sequences. Where the baselines struggle with temporal consistency, VideoCUPS produces stable, temporally coherent panoptic masks.

VideoCUPS (ours) DepthG + VideoCutLER
VideoCUPS (ours) DepthG + VideoCutLER
VideoCUPS (ours) DepthG + VideoCutLER

BibTeX

@inproceedings{Reich:2026:VideoCUPS,
  title     = {Scene-Centric Unsupervised Video Panoptic Segmentation},
  author    = {Reich, Christoph and Hahn, Oliver and Araslanov, Nikita and
               Leal-Taix{\'e}, Laura and Rupprecht, Christian and
               Cremers, Daniel and Roth, Stefan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgments: This project was partially supported by the European Research Council (ERC) Advanced Grant SIMULACRON (grant agreement No. 884679), DFG project CR 250/26-1 “4D-YouTube”, and GNI Project “AICC”. This project was also partially supported by the ERC under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 866008). Additionally, this work has been co-funded by the LOEWE initiative (Hesse, Germany) within the emergenCITY center [LOEWE/1/12/519/03/05.001(0016)/72] and by the Deutsche Forschungsgemeinschaft (German Research Foundation, DFG) under Germany’s Excellence Strategy (EXC 3066/1 “The Adaptive Mind”, Project No. 533717223). Christoph Reich is supported by the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) through the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the German Federal Ministry of Education and Research. Christian Rupprecht is supported by an Amazon Research Award. Finally, we acknowledge the support of the European Laboratory for Learning and Intelligent Systems (ELLIS) and thank Simone Schaub-Meyer for insightful discussions. Website template from DreamFusion and MVDream.

TU Munich TU Darmstadt NVIDIA University of Oxford MCML ELIZA hessian.AI