Beyond Accuracy

Abstract

Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of "well-behavedness" of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 models backbone models and how different training paradigms and model architectures affect the quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high fairness on ImageNet-1k classification and strong robustness against domain changes; (ii) self-supervised learning is an effective training paradigm to improve almost all considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.

Interactive Plot & Table

Welcome to our interactive plot! Here, you can explore the data used in our analysis to gain deeper insights into specific models or groups of models. Use the checkboxes below the scatterplot to select the subsets you'd like to visualize. You can filter by (pre-)training datasets, architectures, and various training paradigms. The selected models will be displayed in a table located below the checkboxes. By default, the table is hidden, but you can reveal it by clicking the 'Show Table' button. Also, you can customize the scatterplot by selecting which quality dimensions to display on the x- and y-axes. If you choose to plot the QUBA score, you can adjust the weights of the dimensions however you like. The default weights are set to the values described in the paper.

Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 866008). The project has also been funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project number 529680848. Further, the project has been supported by the State of Hesse through the cluster projects "The Third Wave of Artificial Intelligence (3AI)" and "The Adaptive Mind (TAM)."

Citation

If you find this project useful, please consider citing:

    @article{Hesse:2025:beyond_accuracy,
        title={Beyond Accuracy: What Matters in Designing Well-Behaved Models?}, 
        author={Robin Hesse and Do\u{g}ukan Ba\u{g}c\i and Bernt Schiele and Simone Schaub-Meyer and Stefan Roth},
        year={2025},
        journal={arXiv:2503.17110 [cs.CV]},
    }