Explanation-Based Anomaly Detection in Deep Neural Networks

If an AI gives you a weird explanation for its prediction, you should remain septical about the accuracy of the prediction. Sounds reasonable?

This was the general idea of my masters thesis, which was originally titled Self-Assessment of Visual Recognition Systems based on Attribution. Today, I would call it Explanation-based Anomaly Detection in Deep Neural Networks. The general idea was to use attribution-based explanation methods to detect anomalies (such as unusual inputs) in convolutional neural networks. This basically boils down to detecting unusual gradients in the network, which, at the time, was, to my knowledge, a novel idea. We did some experiments and found that it somewhat worked in some cases.

Abstract §

Convolutional Neural Networks (CNNs) achieve state of the art results in various visual recognition tasks like object classification and object detection. While CNNs perform surprisingly well, it is difficult to retrace why they arrive at a certain prediction. Additionally, they have been shown to be prone to certain errors. As CNN are increasingly deployed into physical systems - for example in self driving vehicles - undetected errors could result in catastrophic consequences. Approaches to prevent this include the usage of attribution based explanation methods to facilitate an understanding in the systems decision in hindsight, as well as the detection of recognition errors at runtime, called self-assessment. Some state-of-the-art self-assessment approaches aim to detect anomalies in the activation patterns of neurons in a CNN.

This work explores the usage of attribution based explanations for self-assessment of CNNs. We build multiple self-assessment models and evaluate their performance in various settings. In our experiments, we find that, while self-assessment based on attribution does not outperform self-assessment based on neural activity on its own, it always surpasses random guessing. Furthermore, we find that self-assessment models using neural activation patterns as well as neural attribution can in some cases outperform models which do not consider attribution patterns. Thus, we conclude that it might be possible to improve self-assessment models by including the explanation of the model into the assessment-process.