running notes on evaluating interpretability view markdown

  • Interpretability is gaining increased attention, but also coming under much criticism
  • The purpose of interpretations is to help a particular audience solve a particular task. It is important that the evaluations reflect this.
  • It is unclear if modern methods are getting better at downstream tasks: better-looking explanations $\neq$ better explanations.
  • As of now, it is unclear even what forms these methods should take resulting in many different alternatives: e.g. saliency maps, concept activation vectors, hierarchical feature importances, importance curves, and textual explanations. Proper metrics can help guide the development of new forms of intepretability.
  • We want to introduce a set of downstream tasks which serve as baselines for whether interpretability methods are useful and drive the innovation of new forms of useful interpretation. They can be used to derive desiderata for new methods or to evaluate their performance. It is very difficult for the field to move forward until it knows what it is trying to move towards.
  • Note: whenever possible, model-based interpretability (i.e. using an easily understandable model) is preferable to post-hoc interpretability (i.e. interpreting a trained black-box model). Claims that interpretability will work for things like medicine and self-driving cars are errant unless they can reliably work.

work on evaluating interpretability

  • Doshi-Velez and Kim break apart evaluation interpretability into three types
    1. application-grounded (real humans doing real tasks),
    2. human-grounded evaluation (real humans, simple tasks)
    3. functionally-grounded evaluation (no real humans, proxy tasks)
  • another work proposes 3 cognitive tasks:
    1. Simulation - Predicting the system’s recommendation given an explanation and a set of input observations.
    2. Verification - Verifying whether the system’s recommendation is consistent given an explanation and a set of input observations.
    3. Counterfactual - Determining whether the system’s recommendation changes given an explanation, a set of input observations, and a perturbation that changes one dimension of the input observations.
  • Predictive vs descriptive accuracy. We continue to use the framework of predictive vs descriptive, where these are often in conflict. Here, the focus is on measuring descriptive accuracy.
  • Some have proposed metrics to evaluate specific methods. For example, one recent paper proposes a possible method for evaluating feature importance estimates. Similarly, RISE uses random masks on insertion/deletion. These are useful, but still don’t solve the overall problem of not knowing the correct form and it is unclear what downstream tasks these evaluations are useful for.

concrete tasks

  1. improving performance such as predictive accuracy or sample efficiency (finding scores that generalize)
    • for imagenet, this can take the form of refining a model
    • finding failure modes of a model and automatically flipping them works
    • feature engineering: extracting interactions and adding them to a linear model (e.g. on simple datasets - pmlb)
    • show that debugging helps find errors
    • has 2 parts: (1) get the explanation (2) how to use it?
    • might improve data efficiency by having a human in the loop
    • sanity-checks on things like preprocessing, that model is looking at correct features etc.
  2. predicting uncertainty/robustness (can I trust this prediction?)
    • should I trust this prediction or not?
    • does uncertainty correlate to “groundtruth” uncertainty - probability a model would actually get something wrong?
    • can we use this to identify when a model will fail?
    • e.g. Bayesian neural nets give explicit uncertainties
    • e.g. influence funcs paper flips labels and then finds them
  3. discovering/fixing models (things like bias)
    • often we use interpretability to “fix” something about a model
    • want to make the fix generalize (i.e. we find bias is gone inside the model, then we want to make sure this fix works on new data)
    • right for the right reasons
  4. finding causal descriptions (in the data)
    • causal inference finds causal relationships in the data (and has a long history of using simulations)
    • finding causal relationships about the model is different, and can be much simpler
      • descriptions can help summarize something for fundamental understanding (e.g. science)
      • feature importances are often trying to get at some notion like this
    • discovering interactions
    • object deletion seems like a decent task, since it preserves structure of the scene
    • could use gan to fill in missing region

descriptive accuracy

All the above tasks implicitly require that the model has high descriptive accuracy (particularly finding causal patterns in data). This type of descriptive accuracy can be measured via simulation.

  • for images, this requires a new simulation: driving simulator? 3d models projected to 2D? Images of cats w/ eye colors?
  • need to evaluate via simulations
    • ex. finding groundtruth via simulations of images, alter dataset (e.g. texture bias turns into a shape bias by removing texture
      • one problem with this is the model could be mislead and accidentally attribute importance to the wrong part of the image
      • another good baseline is removing objects from images
    • for text, this evaluation would entail evaluating faithfulness to the model, not the input (we shouldn’t be able to change the model’s prediction without changing the explanation and vice versa)


  • we are constantly striving to evaluate interp and make it better (e.g. human experiments in the ACD paper, simulation experiments in the TRIM/DAC papers, improving predictive accuracy in CDEP paper)