GLASS: Guided Latent Slot Diffusion for Object-Centric Learning

Abstract

Object-centric learning aims to decompose an input image into a set of meaningful object files (slots). These latent object representations enable a variety of downstream tasks. Yet, object-centric learning struggles on real-world datasets, which contain multiple objects of complex textures and shapes in natural everyday scenes. To address this, we introduce Guided Latent Slot Diffusion (GLASS), a novel slot-attention model that learns in the space of generated images and uses semantic and instance guidance modules to learn better slot embeddings for various downstream tasks. Our experiments show that GLASS surpasses state-of-the-art slot-attention methods by a wide margin on tasks such as (zero-shot) object discovery and conditional image generation for real-world scenes. Moreover, GLASS enables the first application of slot attention to the compositional generation of complex, realistic scenes.

GLASS at a glance

GLASS is an object-centric representation model that uses a diffusion decoder. GLASS learns in the space of generated images, allowing it to leverage the semantic guidance from a diffusion decoder and instance guidance from a DINOv2 encoder.

Results

GLASS is a object-centric representation learning method that performs multiple downstream tasks like object discovery, compositional generation, conditional generation. and property prediction. We show results for all the tasks.

Object Discovery

GLASS outperforms existing object-centric learning (OCL) methods on the task of object discovery.

GLASS obtains cleaner boundaries and better object-level segmentation compared to existing OCL methods.

Compositional Generation

GLASS is the first object-centric model to enable compositional generation (addition and removal of objects) for realistic scenes.

Object Removal: GLASS is able to remove the highlted object (red) from the scene while preserving the rest of the scene.

Object Addition: GLASS is able to add the highlighted object (red) to a new scene while preserving the rest of the scene.

Conditional Generation

GLASS outperforms StableLSD on the task of conditional generation. Producing much higher quality images than StableLSD.

Object-level Property Prediction

GLASS outperforms StableLSD on the task of object-level property prediction. Note: StableLSD is the closest model in terms of downstream task capabilities.

BibTeX

@inproceedings{singh2025glass,
  author    = {Krishnakant Singh and Simone Scahub-Meyer and Stefan Roth},
  title     = {GLASS: Guided Latent Slot Diffusion for Object-Centric Learning},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2025},
}