*Equal contribution.
TL;DR 🚀: We propose KARMMA, a multimodal-to-multimodal knowledge distillation framework for egocentric action recognition that does not require modality-aligned data. It can handle any subset of the training modalities at inference without retraining, while remaining fast and resource-efficient.
Existing methods for egocentric action recognition often rely solely on RGB videos, although additional modalities, e.g., audio, can improve accuracy in challenging scenarios. However, most multimodal approaches assume all modalities are available at inference, leading to significant accuracy drops, or even failure, when inputs are missing. To address this, we introduce KARMMA, a multimodal Knowledge distillation approach for egocentric Action Recognition robust to Missing ModAlities that requires no modality alignment across all samples during training or inference. KARMMA distills knowledge from a multimodal teacher into a multimodal student that benefits from all available modalities while remaining robust to missing ones, making it suitable for diverse scenarios without retraining. Our student uses approximately 50% fewer computational resources than our teacher, resulting in a lightweight and fast model. Experiments on Epic-Kitchens and Something-Something show that our student achieves competitive accuracy while significantly reducing accuracy drops under missing modality conditions.
The KARMMA training consists of two stages. In the first stage , the teacher processes all available modalities using frozen unimodal feature extractors and learns to fuse their features through a transformer-based fusion block. The fused features are then passed to a multi-head MLP and optimized with cross-entropy loss. In the second stage , the student learns from the frozen, pre-trained teacher using a combination of cross-entropy and knowledge distillation.
Our proposed KARMMA framework incorporates three main enhancements:
Real-world scenarios often involve dynamically missing modalities due to sensor malfunctions. To simulate this, we gradually increase the probability of dropping each modality from 0% to 90% during inference. We evaluate:
(a) Epic-Kitchens
(b) Something-Something
Results show that the Baseline is highly sensitive to missing modalities. Adding modality dropout and our missing-modality strategy significantly improves robustness, as Baseline w/ δ exhibits a much smaller accuracy drop. Furthermore, our distillation pipeline enables KARMMAS to consistently outperform both baselines across all scenarios and datasets.
Our KARMMA student reduces memory usage by at least 50% compared to the teacher while significantly lowering GFLOPs, resulting in a lightweight model with faster inference. The largest savings occur when using only audio (A), as the student employs a much smaller feature extractor than those required for RGB video (V) and optical flow (F).
@article{santos2025multimodal,
author = {Maria Santos-Villafranca and Dustin Carrión-Ojeda and Alejandro Perez-Yus and Jesus Bermudez-Cameo and Jose J. Guerrero and Simone Schaub-Meyer},
title = {Multimodal Knowledge Distillation for Egocentric Action Recognition Robust to Missing Modalities},
journal = {arXiv},
year = {2025}
}
This work was supported by projects PID2021-125209OB-I00 and TED2021-129410B-I00, (MCIN/AEI/10.13039/501100011033 and FEDER/UE and NextGenerationEU/PRTR), DGA 2022-2026 grant and Grant SMT Erasmus+, project 2022-1-ES01-KA131-HED-000065592 funded by Campus Iberus. The project has also been funded funded by the Hessian Ministry of Science and Research, Arts and Culture (HMWK) through the project "The Third Wave of Artificial Intelligence - 3AI". The work was further supported by the Deutsche Forschungsgemeinschaft (German Research Foundation, DFG) under Germany's Excellence Strategy (EXC 3057/1 "Reasonable Artificial Intelligence", Project No. 533677015).
This webpage template is adapted from Nerfies, under a CC BY-SA 4.0 License.