uncertainty
view markdownsome notes on uncertainty in machine learning, particularly deep learning
basics
 calibration  predicted probabilities should match real probabilities
 platt scaling  given trained classifier and new calibration dataset, basically just fit a logistic regression from the classifier predictions > labels
 isotonic regression  nonparametric, requires more data than platt scaling
 piecewiseconstant nondecreasing function instead of logistic regression
 confidence  predicted probability = confidence, max margin, entropy of predicted probabilities across classes
 ensemble uncertainty  ensemble predictions yield uncertainty (e.g. variance within ensemble)
 quantile regression  use quantile loss to penalize models differently + get confidence intervals
 can easily do this with sklearn
 quantile loss = $\begin{cases} \alpha \cdot \Delta & \text{if} \quad \Delta > 0\\(\alpha  1) \cdot \Delta & \text{if} \quad \Delta < 0\end{cases}$
 $\Delta =$ actual  predicted
 SingleModel Uncertainties for Deep Learning (tagovska & lopezpaz 2019)  use simultaneous quantile regression
complementarity
rejection learning
 rejection learning  allow models to reject (not make a prediction) when they are not confidently accurate (chow 1957, cortes et al. 2016)
 To Trust Or Not To Trust A Classifier (jiang, kim et al 2018)  find a trusted region of points based on nearest neighbor density (in some embedding space)
 trust score uses density over some set of nearest neighbors
 do clustering for each class  trust score = distance to once class’s cluster vs the other classes
complementarity
 complementarity  ML should focus on points hard for humans + seek human input on points hard for ML
 note: goal of perception isn’t to learn categories but learn things that are associated with actions
 Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer (madras et al. 2018)  adaptive rejection learning  build on rejection learning considering the strengths/weaknesses of humans
 Learning to Complement Humans (wilder et al. 2020)  2 approaches for how to incorporate human input:
 discriminative approach  jointly train predictive model and policy for deferring to human (witha cost for deferring)
 decisiontheroetic approach  train predictive model + policy jointly based on value of information (VOI)
 do realworld experiments w/ humans to validate: scientific discovery (a galaxy classification task) and medical diagnosis (detection of breast cancer metastasis)
 Gaining Free or LowCost Transparency with Interpretable Partial Substitute (wang, 2019)  given a blackbox model, find a subset of the data for which predictions can be made using a simple rulelist (tong wang has a few papers like this)
 Interpretable Companions for BlackBox Models (pan, wang, et al. 2020)  offer an interpretable, but slightly less acurate model for each decision
 human experiment evaluates how much humans are able to tolerate
 Interpretable Companions for BlackBox Models (pan, wang, et al. 2020)  offer an interpretable, but slightly less acurate model for each decision
outlierdetection
 overview from sklearn
 elliptic envelope  assume data is Gaussian and fit elliptic envelop (maybe robustly) to tell when data is an outlier
 local outlier factor (breunig et al. 2000)  score based on nearest neighbor density
 idea: gradients should be larger if you are on the image manifold
 isolation forest (liu et al. 2008)  lower average number of random splits required to isolate a sample means more outlier
 oneclass svm  estimates the support of a highdimensional distribution using a kernel (2 approaches:)
 separate the data from the origin (with max margin between origin and points) (scholkopf et al. 2000)
 find a sphere boundary around a dataset with the volume of the sphere minimized (tax & duin 2004)
 detachment index (kuenzel 2019)  based on random forest
 for covariate $j$, detachment index $d^j(x) = \sum_i^n w (x, X_i) \vert X_i^j  x^j \vert$
 $w(x, X_i) = \underbrace{1 / T\sum_{t=1}^{T}}{\text{average over T trees}} \frac{\overbrace{1{ X_i \in L_t(x) }}^{\text{is } X_i \text{ in the same leaf?}}}{\underbrace{\vert L_t(x) \vert}{\text{num points in leaf}}}$ is $X_i$ relevant to the point $x$?
 for covariate $j$, detachment index $d^j(x) = \sum_i^n w (x, X_i) \vert X_i^j  x^j \vert$
bayesian approaches
 epistemic uncertainty  uncertainty in the DNN model parameters
 without good estimates of this, often get aleatoric uncertainty wrong (since $p(y\vert x) = \int p(y \vert x, \theta) p(\theta \vert data) d\theta$
 aleatoric uncertainty  inherent and irreducible data noise (e.g. features contradict each other)
 this can usually be gotten by predicting a distr. $p(y \vert x)$ instead of a point estimate
 ex. logistic reg. already does this
 ex. regression  just predict mean and variance of Gaussian
 gaussian processes
neural networks
directly predict uncertainty
 Inhibited Softmax for Uncertainty Estimation in Neural Networks (mozejko et al. 2019)  directly predict uncertainty by adding an extra output during training
 Learning Confidence for OutofDistribution Detection in Neural Networks (devries et al. 2018)  predict both prediction p and confidence c
 during training, learn using $p’ = c \cdot p + (1  c) \cdot y$
 BiasReduced Uncertainty Estimation for Deep Neural Classifiers (geifmen et al. 2019)
 just predicting uncertainty is biased
 estimate uncertainty of highly confident points using earlier snapshots of the trained model
 Contextual Outlier Interpretation (liu et al. 2018)  describe outliers with 3 things: outlierness score, attributes that contribute to the abnormality, and contextual description of its neighborhoods
 Energybased Outofdistribution Detection
 Getting a CLUE: A Method for Explaining Uncertainty Estimates
nearestneighbor methods
 Deep kNearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning (papernot & mcdaniel, 2018)
 distancebased confidence scores (mandelbaum et al. 2017)  use either distance in embedding space or adversarial training to get uncertainties for DNNs
 deep kernel knn (card et al. 2019)  predict labels based on weighted sum of training instances, where weights are given by distance in embedding space
 add an uncertainty based on conformal methods
ensemble approaches
 DNN ensemble uncertainty works  predict mean and variance w/ each network then ensemble (don’t need to do bagging, random init is enough)
 can also use ensemble of snapshots during training (huang et al. 2017)
 alternatively batch ensemble (wen et al. 2020)  have several rank1 keys that index different weights hidden within one neural net
 Deep Ensembles: A Loss Landscape Perspective (fort, hu, & lakshminarayanan, 2020)
 different random initializations provide most diversity
 samples along one path have varying weights but similar predictions
 Pitfalls of InDomain Uncertainty Estimation and Ensembling in Deep Learning  many complex ensemble approaches are similar to just an ensemble of a few randomly initialized DNNs
bayesian neural networks
 blog posts on basics

want $p(\theta x) = \frac {p(x \theta) p(\theta)}{p(x)}$  $p(x)$ is hard to compute

 slides on basics
 Bayes by backprop (blundell et al. 2015)  efficient way to train BNNs using backprop
 Instead of training a single network, trains an ensemble of networks, where each network has its weights drawn from a shared, learned probability distribution. Unlike other ensemble methods, the method typically only doubles the number of parameters yet trains an infinite ensemble using unbiased Monte Carlo estimates of the gradients.
 Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision
 icu bayesian dnns
 focuses on epistemic uncertainty
 could use one model to get uncertainty and other model to predict
 Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning
 dropout at test time gives you uncertainty
 SWAG (maddox et al. 2019)  start with pretrained net then get Gaussian distr. over weights by training with large constant setpsize
 Efficient and Scalable Bayesian Neural Nets with Rank1 Factors (dusenberry, jerfel et al. 2020)  BNNs scale to SGDlevel with better calibration