# 5.10. evaluationÂ¶

## 5.10.1. lossesÂ¶

• define a loss function $$\mathcal{L}$$

• 0-1 loss: $$\vert C-f(X)\vert$$ - hard to minimize (combinatorial)

• $$L_2$$ loss: $$[C-f(X)[^2$$

• risk = $$E_{(x,y)\sim D}[\mathcal L(f(X), y) ]$$

• optimal classifiers

• Bayes classifier minimizes 0-1 loss: $$\hat{f}(X)=C_i$$ if $$P(C_i\vert X)=max_f P(f\vert X)$$

• KNN minimizes $$L_2$$ loss: $$\hat{f}(X)=E(Y\vert X)$$

• classification cost functions

1. misclassification error - not differentiable

2. Gini index: $$\sum_{i != j} p_i q_j$$

3. cross-entropy: $$-\sum_x p(x)\: \log \: \hat p(x)$$, where $$p(x)$$ are usually labels and $$\hat p(x)$$ are softmax outputs

1. only penalizes target class (others penalized implicitly because of softmax)

2. for binary, $$- (p \log \hat p + (1-p) \log (1-\hat p)$$

## 5.10.2. measuresÂ¶

goodness of fit - how well does the learned distribution represent the real distribution?

• accuracy-based

• accuracy = (TP + TN) / (P + N)

• correct classifications / total number of test cases

• balanced accuracy = 1/2 (TP / P + TN / N)

• denominator is total pos/neg

• recall = sensitivity = true-positive rate = TP / P = TP / (TP + FN)

• what fraction of the real positives do we return?

• specificity = true negative rate = TN / N = TN / (TN + FP)

• what fraction of the real negatives do we return?

• false positive rate = FP / N $$= 1 - \text{specificity}$$

• what fraction of the predicted negatives are wrong?

• fraction is total predictions

• precision = positive predictive value = TP / (TP + FP)

• what fraction of the prediction positives are true positives?

• negative predictive value = TN / (FN + TN)

• what fraction of predicted negatives are true negatives?

• F-score is harmonic mean of precision and recall: 2 * (prec * rec) / (prec + rec)

• NRI (controversial): compares 2 modelâ€™s binary predictions

• curves - easiest is often to just plot TP vs TN or FP vs FN

• roc curve: true-positive rate (recall) vs. false-positive rate

• perfect is recall = 1, false positive rate = 0

• precision-recall curve

• summarizing curves

• AUC: area under (either one) of these curves - usually roc

## 5.10.3. comparing two thingsÂ¶

• odds: p1 : not p1

• odds ratio is a ratio of odds

## 5.10.4. cvÂ¶

• cross validation - donâ€™t have enough data for a test set

• properties

• not good when n < complexity of predictor

• because summands are correlated

• assume data units are exchangeable

• can sometimes use this to pick k for k-means

• data is reused

• types

1. k-fold - split data into N pieces

• N-1 pieces for fit model, 1 for test

• cycle through all N cases

• average the values we get for testing

1. leave one out (LOOCV)

• train on all the data and only test on one

• then cycle through everything

1. random split - shuffle and repeat

2. one-way CV = prequential analysis - keep testing on next data point, updating model

3. ESCV - penalize variance between folds

• regularization path of a regression - plot each coeff v. $$\lambda$$

• tells you which features get pushed to 0 and when

• for OLS (and maybe other linear models), can compute leave-one-out CV without training separate models

## 5.10.5. stabilityÂ¶

1. computational stability

• randomness in the algorithm

• perturbations to models

1. generalization stability

• perturbations to data

• sampling methods

1. bootstrap - take a sample

• repeatedly sample from observed sample w/ replacement

• bootstrap samples has same size as observed sample

1. subsampling

• sample without replacement

1. jackknife resampling

• subsample containing all but one of the points

## 5.10.6. other considerationsÂ¶

• computational cost

• interpretability

• model-selection criteria

• adjusted $$R^2_p$$ - penalty

• Mallowâ€™s $$C_p$$

• $$AIC_p$$

• $$BIC_p$$

• PRESS