Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study
Abstract
MMDG-Bench presents a unified benchmark for multimodal domain generalization that standardizes evaluation across diverse tasks and modalities while revealing limited performance gains and significant robustness challenges.
Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field's advancement. To address this issue, we introduce MMDG-Bench, the first unified and comprehensive benchmark for MMDG, which standardizes evaluation across six datasets spanning three diverse tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench encompasses six modality combinations, nine representative methods, and multiple evaluation settings. Beyond standard accuracy, it systematically assesses corruption robustness, missing-modality generalization, misclassification detection, and out-of-distribution detection. With 7, 402 neural networks trained in total across 95 unique cross-domain tasks, MMDG-Bench yields five key findings: (1) under fair comparisons, recent specialized MMDG methods offer only marginal improvements over ERM baseline; (2) no single method consistently outperforms others across datasets or modality combinations; (3) a substantial gap to upper-bound performance persists, indicating that MMDG remains far from solved; (4) trimodal fusion does not consistently outperform the strongest bimodal configurations; and (5) all evaluated methods exhibit significant degradation under corruption and missing-modality scenarios, with some methods further compromising model trustworthiness.
Community
MMDG-Bench is the first comprehensive and standardized benchmark for Multimodal Domain Generalization (MMDG).
Unlike prior work that focuses on limited datasets or settings, MMDG-Bench unifies evaluation across multiple tasks, modalities, and real-world challenges, including corruption robustness, missing modalities, and model trustworthiness.
the take-away from mmdg-bench is that, under fair comparisons, recent multimodal domain generalization methods offer only marginal gains over a strong erm baseline. that hits hard because it suggests many claimed wins might ride on evaluation artifacts rather than real algorithmic progress. the trimodal-vs-bimodal finding and the large degradation under corruption missing modalities emphasize that robustness, uncertainty estimation, and realistic data shifts should be the core of future work, not just clean cross-domain accuracy. the arxivlens breakdown helped me parse the method details, and i think a quick read of that walkthrough is worth it here: https://arxivlens.com/PaperView/Details/are-we-making-progress-in-multimodal-domain-generalization-a-comprehensive-benchmark-study-8506-d30b7a3f
Get this paper in your agent:
hf papers read 2605.06643 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper