some notes on uncertainty in machine learning, particularly deep learning
1.11. uncertaintyÂ¶
1.11.1. basicsÂ¶
calibration  predicted probabilities should match real probabilities
platt scaling  given trained classifier and new calibration dataset, basically just fit a logistic regression from the classifier predictions > labels
isotonic regression  nonparametric, requires more data than platt scaling
piecewiseconstant nondecreasing function instead of logistic regression
confidence  predicted probability = confidence, max margin, entropy of predicted probabilities across classes
ensemble uncertainty  ensemble predictions yield uncertainty (e.g. variance within ensemble)
quantile regression  use quantile loss to penalize models differently + get confidence intervals
quantile loss = \(\begin{cases} \alpha \cdot \Delta & \text{if} \quad \Delta > 0\\\\(\alpha  1) \cdot \Delta & \text{if} \quad \Delta < 0\end{cases}\)
\(\Delta =\) actual  predicted
SingleModel Uncertainties for Deep Learning (tagovska & lopezpaz 2019)  use simultaneous quantile regression
1.11.2. complementarityÂ¶
1.11.2.1. rejection learningÂ¶
rejection learning  allow models to reject (not make a prediction) when they are not confidently accurate (chow 1957, cortes et al. 2016)
To Trust Or Not To Trust A Classifier (jiang, kim et al 2018)  find a trusted region of points based on nearest neighbor density (in some embedding space)
trust score uses density over some set of nearest neighbors
do clustering for each class  trust score = distance to once classâ€™s cluster vs the other classes
1.11.2.2. complementarityÂ¶
complementarity  ML should focus on points hard for humans + seek human input on points hard for ML
note: goal of perception isnâ€™t to learn categories but learn things that are associated with actions
Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer (madras et al. 2018)  adaptive rejection learning  build on rejection learning considering the strengths/weaknesses of humans
Learning to Complement Humans (wilder et al. 2020)  2 approaches for how to incorporate human input:
discriminative approach  jointly train predictive model and policy for deferring to human (witha cost for deferring)
decisiontheroetic approach  train predictive model + policy jointly based on value of information (VOI)
do realworld experiments w/ humans to validate: scientific discovery (a galaxy classification task) and medical diagnosis (detection of breast cancer metastasis)
Gaining Free or LowCost Transparency with Interpretable Partial Substitute (wang, 2019)  given a blackbox model, find a subset of the data for which predictions can be made using a simple rulelist (tong wang has a few papers like this)
Interpretable Companions for BlackBox Models (pan, wang, et al. 2020)  offer an interpretable, but slightly less acurate model for each decision
human experiment evaluates how much humans are able to tolerate
1.11.3. outlierdetectionÂ¶
overview from sklearn
elliptic envelope  assume data is Gaussian and fit elliptic envelop (maybe robustly) to tell when data is an outlier
local outlier factor (breunig et al. 2000)  score based on nearest neighbor density
idea: gradients should be larger if you are on the image manifold
isolation forest (liu et al. 2008)  lower average number of random splits required to isolate a sample means more outlier
oneclass svm  estimates the support of a highdimensional distribution using a kernel (2 approaches:)
separate the data from the origin (with max margin between origin and points) (scholkopf et al. 2000)
find a sphere boundary around a dataset with the volume of the sphere minimized (tax & duin 2004)
detachment index (kuenzel 2019)  based on random forest
for covariate \(j\), detachment index \(d^j(x) = \sum_i^n w (x, X_i) \vert X_i^j  x^j \vert\)
\(w(x, X_i) = \underbrace{1 / T\sum_{t=1}^{T}}_{\text{average over T trees}} \frac{\overbrace{1\{ X_i \in L_t(x) \}}^{\text{is } X_i \text{ in the same leaf?}}}{\underbrace{\vert L_t(x) \vert}_{\text{num points in leaf}}}\) is \(X_i\) relevant to the point \(x\)?
1.11.4. bayesian approachesÂ¶
epistemic uncertainty  uncertainty in the DNN model parameters
without good estimates of this, often get aleatoric uncertainty wrong (since \(p(y\vert x) = \int p(y \vert x, \theta) p(\theta \vert data) d\theta\)
aleatoric uncertainty  inherent and irreducible data noise (e.g. features contradict each other)
this can usually be gotten by predicting a distr. \(p(y \vert x)\) instead of a point estimate
ex. logistic reg. already does this
ex. regression  just predict mean and variance of Gaussian
1.11.5. neural networksÂ¶
1.11.5.1. directly predict uncertaintyÂ¶
Inhibited Softmax for Uncertainty Estimation in Neural Networks (mozejko et al. 2019)  directly predict uncertainty by adding an extra output during training
Learning Confidence for OutofDistribution Detection in Neural Networks (devries et al. 2018)  predict both prediction p and confidence c
during training, learn using \(p' = c \cdot p + (1  c) \cdot y\)
BiasReduced Uncertainty Estimation for Deep Neural Classifiers (geifmen et al. 2019)
just predicting uncertainty is biased
estimate uncertainty of highly confident points using earlier snapshots of the trained model
Contextual Outlier Interpretation (liu et al. 2018)  describe outliers with 3 things: outlierness score, attributes that contribute to the abnormality, and contextual description of its neighborhoods
Getting a CLUE: A Method for Explaining Uncertainty Estimates
1.11.5.2. nearestneighbor methodsÂ¶
Deep kNearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning (papernot & mcdaniel, 2018)
distancebased confidence scores (mandelbaum et al. 2017)  use either distance in embedding space or adversarial training to get uncertainties for DNNs
deep kernel knn (card et al. 2019)  predict labels based on weighted sum of training instances, where weights are given by distance in embedding space
add an uncertainty based on conformal methods
1.11.5.3. ensemble approachesÂ¶
DNN ensemble uncertainty works  predict mean and variance w/ each network then ensemble (donâ€™t need to do bagging, random init is enough)
can also use ensemble of snapshots during training (huang et al. 2017)
alternatively batch ensemble (wen et al. 2020)  have several rank1 keys that index different weights hidden within one neural net
Deep Ensembles: A Loss Landscape Perspective (fort, hu, & lakshminarayanan, 2020)
different random initializations provide most diversity
samples along one path have varying weights but similar predictions
Pitfalls of InDomain Uncertainty Estimation and Ensembling in Deep Learning  many complex ensemble approaches are similar to just an ensemble of a few randomly initialized DNNs
1.11.5.4. bayesian neural networksÂ¶

want \(p(\thetax) = \frac {p(x\theta) p(\theta)}{p(x)}\)
\(p(x)\) is hard to compute
Bayes by backprop (blundell et al. 2015)  efficient way to train BNNs using backprop
Instead of training a single network, trains an ensemble of networks, where each network has its weights drawn from a shared, learned probability distribution. Unlike other ensemble methods, the method typically only doubles the number of parameters yet trains an infinite ensemble using unbiased Monte Carlo estimates of the gradients.
Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision

focuses on epistemic uncertainty
could use one model to get uncertainty and other model to predict
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
dropout at test time gives you uncertainty
SWAG (maddox et al. 2019)  start with pretrained net then get Gaussian distr. over weights by training with large constant setpsize
Efficient and Scalable Bayesian Neural Nets with Rank1 Factors (dusenberry, jerfel et al. 2020)  BNNs scale to SGDlevel with better calibration