some notes on uncertainty in machine learning, particularly deep learning

# 1.11. uncertaintyÂ¶

## 1.11.1. basicsÂ¶

• calibration - predicted probabilities should match real probabilities

• platt scaling - given trained classifier and new calibration dataset, basically just fit a logistic regression from the classifier predictions -> labels

• isotonic regression - nonparametric, requires more data than platt scaling

• piecewise-constant non-decreasing function instead of logistic regression

• confidence - predicted probability = confidence, max margin, entropy of predicted probabilities across classes

• ensemble uncertainty - ensemble predictions yield uncertainty (e.g. variance within ensemble)

• quantile regression - use quantile loss to penalize models differently + get confidence intervals

## 1.11.2. complementarityÂ¶

### 1.11.2.1. rejection learningÂ¶

• rejection learning - allow models to reject (not make a prediction) when they are not confidently accurate (chow 1957, cortes et al. 2016)

• To Trust Or Not To Trust A Classifier (jiang, kim et al 2018) - find a trusted region of points based on nearest neighbor density (in some embedding space)

• trust score uses density over some set of nearest neighbors

• do clustering for each class - trust score = distance to once classâ€™s cluster vs the other classes

### 1.11.2.2. complementarityÂ¶

• complementarity - ML should focus on points hard for humans + seek human input on points hard for ML

• note: goal of perception isnâ€™t to learn categories but learn things that are associated with actions

• Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer (madras et al. 2018) - adaptive rejection learning - build on rejection learning considering the strengths/weaknesses of humans

• Learning to Complement Humans (wilder et al. 2020) - 2 approaches for how to incorporate human input:

• discriminative approach - jointly train predictive model and policy for deferring to human (witha cost for deferring)

• decision-theroetic approach - train predictive model + policy jointly based on value of information (VOI)

• do real-world experiments w/ humans to validate: scientific discovery (a galaxy classification task) and medical diagnosis (detection of breast cancer metastasis)

• Gaining Free or Low-Cost Transparency with Interpretable Partial Substitute (wang, 2019) - given a black-box model, find a subset of the data for which predictions can be made using a simple rule-list (tong wang has a few papers like this)

## 1.11.3. outlier-detectionÂ¶

• overview from sklearn

• elliptic envelope - assume data is Gaussian and fit elliptic envelop (maybe robustly) to tell when data is an outlier

• local outlier factor (breunig et al. 2000) - score based on nearest neighbor density

• idea: gradients should be larger if you are on the image manifold

• isolation forest (liu et al. 2008) - lower average number of random splits required to isolate a sample means more outlier

• one-class svm - estimates the support of a high-dimensional distribution using a kernel (2 approaches:)

• separate the data from the origin (with max margin between origin and points) (scholkopf et al. 2000)

• find a sphere boundary around a dataset with the volume of the sphere minimized (tax & duin 2004)

• detachment index (kuenzel 2019) - based on random forest

• for covariate $$j$$, detachment index $$d^j(x) = \sum_i^n w (x, X_i) \vert X_i^j - x^j \vert$$

• $$w(x, X_i) = \underbrace{1 / T\sum_{t=1}^{T}}_{\text{average over T trees}} \frac{\overbrace{1\{ X_i \in L_t(x) \}}^{\text{is } X_i \text{ in the same leaf?}}}{\underbrace{\vert L_t(x) \vert}_{\text{num points in leaf}}}$$ is $$X_i$$ relevant to the point $$x$$?

## 1.11.4. bayesian approachesÂ¶

• epistemic uncertainty - uncertainty in the DNN model parameters

• without good estimates of this, often get aleatoric uncertainty wrong (since $$p(y\vert x) = \int p(y \vert x, \theta) p(\theta \vert data) d\theta$$

• aleatoric uncertainty - inherent and irreducible data noise (e.g. features contradict each other)

• this can usually be gotten by predicting a distr. $$p(y \vert x)$$ instead of a point estimate

• ex. logistic reg. already does this

• ex. regression - just predict mean and variance of Gaussian

• gaussian processes