Multimodal Knowledge Distillation for Egocentric Action Recognition Robust to Missing Modalities

1I3A – University of Zaragoza, 2Technical University of Darmstadt, 3hessian.AI

*Equal contribution

KARMMA motivation

TL;DR 🚀: KARMMA is a multimodal-to-multimodal distillation framework for egocentric action recognition that does not require modality-aligned data and supports any subset of modalities at inference. It produces a fast and lightweight student that remains robust under missing modalities without retraining.


Abstract

Egocentric action recognition enables robots to facilitate human-robot interactions and monitor task progress. Existing methods often rely solely on RGB videos, although additional modalities, such as audio, can improve accuracy under challenging conditions. However, most multimodal approaches assume that all modalities are available at inference time, leading to significant accuracy drops, or even failure, when inputs are missing. To address this limitation, we introduce KARMMA, a multimodal Knowledge distillation framework for egocentric Action Recognition robust to Missing ModAlities that does not require modality alignment across all samples during training or inference. KARMMA distills knowledge from a multimodal teacher into a multimodal student that leverages all available modalities while remaining robust to missing ones, enabling deployment across diverse sensor configurations without retraining. Our student uses approximately 50% fewer computational resources than the teacher, resulting in a lightweight and fast model that is well suited for on-robot deployment. Experiments on Epic-Kitchens and Something-Something demonstrate that our student achieves competitive accuracy while significantly reducing performance degradation under missing modality conditions.

Method

The KARMMA training consists of two stages. In the first stage , the teacher processes the available modalities using frozen, pre-trained unimodal feature extractors and learns to fuse their features through a transformer-based fusion block. The fused features are then passed to a multi-head MLP and optimized with cross-entropy loss. In the second stage , the student, built with smaller feature extractors and a compact fusion block, learns from the frozen teacher using a combination of cross-entropy and knowledge distillation.

Teacher diagram
Distillation diagram

Our proposed KARMMA framework incorporates three main enhancements:

  1. Modality Dropout: Entire modalities are randomly dropped with probability p, while ensuring that at least one modality remains active. We apply modality dropout to both the teacher and the student, enabling training without requiring modality-aligned samples.
  2. Missing Modality Strategy: To enhance the student's robustness to missing modalities, we introduce two types of learnable tokens: modality-specific tokens, which differentiate modalities and act similarly to positional encodings, and token-specific tokens, which compensate when a modality is absent.
  3. Knowledge Distillation: The frozen teacher transfers its knowledge to the lightweight student by aligning their class probability distributions via Kullback-Leibler (KL) divergence, improving accuracy and robustness to missing modalities.

Inference Robustness to Missing Modalities

In real-world robotic deployments, sensors may fail at run time (e.g., occluded cameras or muted microphones), leading to dynamically missing modalities. To emulate such conditions, we gradually increase the probability of dropping each modality from 0% to 90% during inference. We evaluate:

  • Baseline: same architecture as KARMMAS but trained end-to-end with cross-entropy loss, without KARMMA enhancements.
  • Baseline w/ δ: adds modality dropout and our missing-modality strategy.
  • KARMMAS: our proposed student with all enhancements.
Missing modalities plot

(a) Epic-Kitchens

Missing modalities plot

(b) Something-Something

Results show that the Baseline is highly sensitive to missing modalities, suffering large accuracy degradation even at low dropout rates. Incorporating modality dropout and our missing-modality strategy significantly improves robustness, as Baseline w/ δ exhibits a much smaller accuracy drop. Furthermore, our distillation pipeline enables KARMMAS to consistently outperform both baselines across all scenarios and datasets.

Resource Efficiency

Our KARMMA student reduces memory usage by approximately 50% at inference compared to the teacher while significantly lowering GFLOPs, resulting in a lightweight model suitable for edge and on-robot deployment. The largest savings occur in unimodal audio (A) settings, where the student employs a compact audio encoder, substantially smaller than those used for RGB video (V) and optical flow (F).

memory
gflops

Qualitative Results

BibTeX

@inproceedings{carrion2026karmma,
    author    = {Dustin Carrión-Ojeda* and Maria Santos-Villafranca* and Alejandro Perez-Yus and Jesus Bermudez-Cameo and Jose J. Guerrero and Simone Schaub-Meyer},
    title     = {Multimodal Knowledge Distillation for Egocentric Action Recognition Robust to Missing Modalities},
    booktitle = {Proceedings of the IEEE International Conference on Robotics & Automation (ICRA)},
    year      = {2026},
    note      = {to appear}
}

Acknowledgments

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy (EXC-3057/1 "Reasonable Artificial Intelligence", Project No. 533677015). We gratefully acknowledge support from the hessian.AI Service Center (funded by the Federal Ministry of Research, Technology and Space, BMFTR, grant No. 16IS22091) and the hessian.AI Innovation Lab (funded by the Hessian Ministry for Digital Strategy and Innovation, grant No. S-DIW04/0013/003). SSM has been funded by the DFG – 529680848. MSV, APY, JBC, and JJG further acknowledge support by projects PID2024-158322OB-I00 and PID2021-125209OB-I00, (MCIN/AEI/10.13039/501100011033/ FEDER, UE), project JIUZ2024-IyA-07, DGA 2022-2026 scholarship, and Grant SMT Erasmus+, project 2022-1-ES01-KA131-HED-000065592 funded by Campus Iberus.

This webpage template is adapted from Nerfies, under a CC BY-SA 4.0 License.