Addressing Randomness in Evaluation Protocols for Out-of-Distribution Detection

Paper: Here

Our Paper Addressing Randomness in Evaluation Protocols for Out-of-Distribution Detection has been accepted at the ICJAI 2021 Workshop for Artificial Intelligence for Anomalies and Novelties.

In summary, we investigated the following phenomenon: when you train neural networks several times, and then measure their performance on some task, there is a certain variance in the performance measurements, since the results of experiments may vary based on several factors (that are effectively controlled by the random seed). We investigated how the performance measures for several evaluation protocols used in Anomaly Detection, Out-of-Distribution Detection, Open Set Recognition (OSR) and related fields vary when the random seed is varied.

In some of these fields, like OSR, it is common to measure the average performance over 3-5 experiments. Is this sufficient to draw reliable conclusions regarding a possible performance difference between methods?

We found that the variance is so large that it may, in fact, not. Consequentially, experiments based on too few random seed might provide a brittle foundation for conclusions. We the argue that such experiments should rather be seen as a fundamentally random process. Therefore, we should measure the expected value of the performance $\mathbb{E}_{x \sim p} [ f(x) ] $ where $p$ is the distribution of the random seeds and $f$ is an experimental setting.

Given a set of measurements, we can use statistical tests to determine if an observed difference can be considered significant. However, we found that in some cases even 1000 experiments were insufficient to infer significant differences in the results.


Last Updated: 13 Jul. 2021
Categories: Anomaly Detection · Reproducibility
Tags: AI4AN · Anomaly Detection · Reproducibility