evaluation
view markdownlosses
 define a loss function $\mathcal{L}$
 01 loss: $\vert Cf(X)\vert$  hard to minimize (combinatorial)
 $L_2$ loss: $[Cf(X)[^2$
 risk = $E_{(x,y)\sim D}[\mathcal L(f(X), y) ]$
 optimal classifiers
 Bayes classifier minimizes 01 loss: $\hat{f}(X)=C_i$ if $P(C_i\vert X)=max_f P(f\vert X)$
 KNN minimizes $L_2$ loss: $\hat{f}(X)=E(Y\vert X)$
 classification cost functions
 misclassification error  not differentiable
 Gini index: $\sum_{i != j} p_i q_j$
 crossentropy: $\sum_x p(x): \log : \hat p(x) $, where $p(x)$ are usually labels and $\hat p(x)$ are softmax outputs
 only penalizes target class (others penalized implicitly because of softmax)
 for binary, $ (p \log \hat p + (1p) \log (1\hat p)$
measures
goodness of fit  how well does the learned distribution represent the real distribution?
 accuracybased
 accuracy = (TP + TN) / (P + N)
 correct classifications / total number of test cases
 balanced accuracy = 1/2 (TP / P + TN / N)
 accuracy = (TP + TN) / (P + N)
 denominator is total pos/neg
 recall = sensitivity = truepositive rate = TP / P = TP / (TP + FN)
 what fraction of the real positives do we return?
 specificity = true negative rate = TN / N = TN / (TN + FP)
 what fraction of the real negatives do we return?
 false positive rate = FP / N $= 1  \text{specificity}$
 what fraction of the predicted negatives are wrong?
 recall = sensitivity = truepositive rate = TP / P = TP / (TP + FN)
 fraction is total predictions
 precision = positive predictive value = TP / (TP + FP)
 what fraction of the prediction positives are true positives?
 negative predictive value = TN / (FN + TN)
 what fraction of predicted negatives are true negatives?
 precision = positive predictive value = TP / (TP + FP)
 Fscore is harmonic mean of precision and recall: 2 * (prec * rec) / (prec + rec)

curves  easiest is often to just plot TP vs TN or FP vs FN
 roc curve: truepositive rate (recall) vs. falsepositive rate
 perfect is recall = 1, false positive rate = 0
 precisionrecall curve
 AUC: area under (either one) of these curves  usually roc
 roc curve: truepositive rate (recall) vs. falsepositive rate
comparing two things
 odds: p1 : not p1
 odds ratio is a ratio of odds
cv
 cross validation  don’t have enough data for a test set
 properties
 not good when n < complexity of predictor
 because summands are correlated
 assume data units are exchangeable
 can sometimes use this to pick k for kmeans
 data is reused
 types
 kfold  split data into N pieces
 N1 pieces for fit model, 1 for test
 cycle through all N cases
 average the values we get for testing
 leave one out (LOOCV)
 train on all the data and only test on one
 then cycle through everything
 random split  shuffle and repeat
 oneway CV = prequential analysis  keep testing on next data point, updating model
 ESCV  penalize variance between folds
 kfold  split data into N pieces
 properties

regularization path of a regression  plot each coeff v. $\lambda$
 tells you which features get pushed to 0 and when
 for OLS (and maybe other linear models), can compute leaveoneout CV without training separate models
stability
 computational stability
 randomness in the algorithm
 perturbations to models
 generalization stability
 perturbations to data
 sampling methods
 bootstrap  take a sample
 repeatedly sample from observed sample w/ replacement
 bootstrap samples has same size as observed sample
 subsampling
 sample without replacement
 jackknife resampling
 subsample containing all but one of the points
other considerations
 computational cost
 interpretability
 modelselection criteria
 adjusted $R^2_p$  penalty
 Mallow’s $C_p$
 $AIC_p$
 $BIC_p$
 PRESS