TL;DR: SceneDINO is unsupervised and infers 3D geometry and features from a single image in a feed-forward manner. Distilling and clustering features lead to unsupervised semantic scene completion predictions. SceneDINO is trained using multi-view self-supervision.
Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.
SceneDINO is trained using multi-view self-supervision and learns a feature field composed of high-dimensional DINO features and 3D geometry. Distilling SceneDINO's feature field results in unsupervised semantic prediction. During inference, SceneDINO performs SSC, given a single RGB input image.
(a) Inference: Given a single input image \(\mathbf{I}_{0}\) during inference, a 2D encoder-decoder \(\xi\) produces the embedding \(\mathbf{E}\) from which the local embedding \(\mathbf{e}_{\mathbf{u}}\) is interpolated. The MLP encoder \(\phi\) takes in \(\mathbf{e}_{\mathbf{u}}\) and 3D position \(\mathbf{x}_{i}\), and predicts both the density \(\sigma_{\mathbf{x}_{i}}\) and the 3D feature \(f_{\mathbf{x}_{i}}\). Using a lightweight unsupervised segmentation head \(h\), we can obtain semantic predictions \(p_{\mathbf{x}_{i}}\) using \(f_{\mathbf{x}_{i}}\).
(b) Rendering: Our feature field allows for volume rendering by shooting a ray through it, yielding depth \(\hat{d}\) and \(\hat{f}\) in 2D. Color \(c_{i}\) is sampled from an another view (e.g., \(\mathbf{I}_{1}\)) using \(\mathbf{u}_{s}\) and rendered to obtain the reconstructed color \(\hat{c}\).
(c) Multi-view training: We render 2D views (features & images) from our feature field and reconstruct the training views.
We provide visualizations of SceneDINO's feature field and semantic scene completion for three scenes from KITTI-360, alongside the corresponding 3D ground truth. We visualize the top 3 PCA components of the feature field. Click on any input image below to view SceneDINO's predictions. We only visualize surface voxels within the field of view for the sake of clarity.
We compare the original dense DINO features with our SceneDINO features rendered back into the image plane. The visualized PCA components are mapped to RGB channels. Interactively adjust which components to display.
@inproceedings{Jevtic:2025:SceneDINO,
author = {Aleksandar Jevti{\'c} and
Christoph Reich and
Felix Wimbauer and
Oliver Hahn and
Christian Rupprecht and
Stefan Roth and
Daniel Cremers},
title = {Feed-Forward {SceneDINO} for Unsupervised Semantic Scene Completion},
booktitle = {IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025},
}
This project was partially supported by the European Research Council (ERC) Advanced Grant SIMULACRON, DFG project CR 250/26-1 ``4D-YouTube'', and GNI Project ``AICC''. This project has also received funding from the ERC under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 866008). Additionally, this work has further been co-funded by the LOEWE initiative (Hesse, Germany) within the emergenCITY center [LOEWE/1/12/519/03/05.001(0016)/72] and by the Excellence Cluster EXC3066 ``The Adaptive Mind''. Christoph Reich is supported by the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) through the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the Federal Ministry of Education and Research. Christian Rupprecht is supported by an Amazon Research Award. Finally, we acknowledge the support of the European Laboratory for Learning and Intelligent Systems (ELLIS) and thank Mateo de Mayo as well as Igor Cvišić for help with estimating camera poses.