Multimodal Knowledge Distillation for Egocentric Action Recognition Robust to Missing Modalities

1University of Zaragoza, 2TU Darmstadt, 3hessian.AI

*Equal contribution.

KARMMA motivation

TL;DR 🚀: We propose KARMMA, a multimodal-to-multimodal knowledge distillation framework for egocentric action recognition that does not require modality-aligned data. It can handle any subset of the training modalities at inference without retraining, while remaining fast and resource-efficient.


Abstract

Existing methods for egocentric action recognition often rely solely on RGB videos, although additional modalities, e.g., audio, can improve accuracy in challenging scenarios. However, most multimodal approaches assume all modalities are available at inference, leading to significant accuracy drops, or even failure, when inputs are missing. To address this, we introduce KARMMA, a multimodal Knowledge distillation approach for egocentric Action Recognition robust to Missing ModAlities that requires no modality alignment across all samples during training or inference. KARMMA distills knowledge from a multimodal teacher into a multimodal student that benefits from all available modalities while remaining robust to missing ones, making it suitable for diverse scenarios without retraining. Our student uses approximately 50% fewer computational resources than our teacher, resulting in a lightweight and fast model. Experiments on Epic-Kitchens and Something-Something show that our student achieves competitive accuracy while significantly reducing accuracy drops under missing modality conditions.

Method

The KARMMA training consists of two stages. In the first stage , the teacher processes all available modalities using frozen unimodal feature extractors and learns to fuse their features through a transformer-based fusion block. The fused features are then passed to a multi-head MLP and optimized with cross-entropy loss. In the second stage , the student learns from the frozen, pre-trained teacher using a combination of cross-entropy and knowledge distillation.

Teacher diagram
Distillation diagram

Our proposed KARMMA framework incorporates three main enhancements:

  1. Modality Dropout: Entire modalities are randomly dropped with probability p while ensuring at least one modality remains active. We apply modality dropout to both the teacher and the student so neither network relies on having the full modality set during training.
  2. Missing Modality Strategy: To improve the student's robustness to missing inputs, we use two types of learnable tokens, modality-specific and token-specific. The former differentiates modalities and acts similarly to positional encodings, while the latter helps the model adapt when a modality is absent.
  3. Knowledge Distillation: The frozen teacher transfers its knowledge to the lightweight student by aligning their class probability distributions using Kullback-Leibler (KL) divergence.

Impact of Missing Modalities at Inference

Real-world scenarios often involve dynamically missing modalities due to sensor malfunctions. To simulate this, we gradually increase the probability of dropping each modality from 0% to 90% during inference. We evaluate:

  • Baseline: same architecture as KARMMAS but trained end-to-end with cross-entropy loss, without KARMMA enhancements.
  • Baseline w/ δ: adds modality dropout and our missing-modality strategy.
  • KARMMAS: our proposed student with all enhancements.

Missing modalities plot

(a) Epic-Kitchens

Missing modalities plot

(b) Something-Something

Results show that the Baseline is highly sensitive to missing modalities. Adding modality dropout and our missing-modality strategy significantly improves robustness, as Baseline w/ δ exhibits a much smaller accuracy drop. Furthermore, our distillation pipeline enables KARMMAS to consistently outperform both baselines across all scenarios and datasets.

Resource Efficiency

Our KARMMA student reduces memory usage by at least 50% compared to the teacher while significantly lowering GFLOPs, resulting in a lightweight model with faster inference. The largest savings occur when using only audio (A), as the student employs a much smaller feature extractor than those required for RGB video (V) and optical flow (F).

memory
gflops

Qualitative Results

BibTeX

@article{santos2025multimodal,
    author  = {Maria Santos-Villafranca and Dustin Carrión-Ojeda and Alejandro Perez-Yus and Jesus Bermudez-Cameo and Jose J. Guerrero and Simone Schaub-Meyer},
    title   = {Multimodal Knowledge Distillation for Egocentric Action Recognition Robust to Missing Modalities},
    journal = {arXiv},
    year    = {2025}
}

Acknowledgments

This work was supported by projects PID2021-125209OB-I00 and TED2021-129410B-I00, (MCIN/AEI/10.13039/501100011033 and FEDER/UE and NextGenerationEU/PRTR), DGA 2022-2026 grant and Grant SMT Erasmus+, project 2022-1-ES01-KA131-HED-000065592 funded by Campus Iberus. The project has also been funded funded by the Hessian Ministry of Science and Research, Arts and Culture (HMWK) through the project "The Third Wave of Artificial Intelligence - 3AI". The work was further supported by the Deutsche Forschungsgemeinschaft (German Research Foundation, DFG) under Germany's Excellence Strategy (EXC 3057/1 "Reasonable Artificial Intelligence", Project No. 533677015).

This webpage template is adapted from Nerfies, under a CC BY-SA 4.0 License.