Investigating Effects of Saturation in Integrated Gradients

Human Interpretability Workshop at ICML


Integrated gradients is a popular method for post-hoc model interpretability. Despite its popularity, the composition and relative impact of different regions of the integral are not well understood. We explore these effects and find that gradients in saturated regions of the scaling factor, where model output changes minimally, contribute disproportionately to the computed attribution. We propose a variant of Integrated Gradients which primarily captures gradients in unsaturated regions and evaluate this method on ImageNet classification networks. We find that this attribution technique shows higher model faithfulness and lower sensitivity to noise than standard Integrated Gradients.

Related Publications

All Publications

AISTATS - April 13, 2021

Continual Learning using a Bayesian Nonparametric Dictionary of Weight Factors

Nikhil Mehta, Kevin J Liang, Vinay K Verma, Lawrence Carin

NeurIPS - December 6, 2020

Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric

NeurIPS - December 7, 2020

Labelling unlabelled videos from scratch with multi-modal self-supervision

Yuki M. Asano, Mandela Patrick, Christian Rupprecht, Andrea Vedaldi

NeurIPS - December 7, 2020

Adversarial Example Games

Avishek Joey Bose, Gauthier Gidel, Hugo Berard, Andre Cianflone, Pascal Vincent, Simon Lacoste-Julien, William L. Hamilton

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy