5.10. evaluation#

5.10.1. losses#

  • define a loss function \(\mathcal{L}\)

    • 0-1 loss: \(\vert C-f(X)\vert\) - hard to minimize (combinatorial)

    • \(L_2\) loss: \([C-f(X)]^2\)

  • risk = \(E_{(x,y)\sim D}[\mathcal L(f(X), y) ]\)

  • optimal classifiers

    • Bayes classifier minimizes 0-1 loss: \(\hat{f}(X)=C_i\) if \(P(C_i\vert X)=\max_f P(f\vert X)\)

    • KNN minimizes \(L_2\) loss: \(\hat{f}(X)=E(Y\vert X)\)

  • classification cost functions

    1. misclassification error - not differentiable

    2. Gini index: \(\sum_{i != j} p_i q_j\)

    3. cross-entropy: \(-\sum_x p(x)\: \log \: \hat p(x) \), where \(p(x)\) are usually labels and \(\hat p(x)\) are softmax outputs

      1. only penalizes target class (others penalized implicitly because of softmax)

      2. for binary, \(- p \log \hat p - (1-p) \log (1-\hat p)\)

5.10.2. measures#

goodness of fit - how well does the learned distribution represent the real distribution?

Screen Shot 2019-06-30 at 8.27.56 PM

  • accuracy-based

    • accuracy = (TP + TN) / (P + N)

      • correct classifications / total number of test cases

    • balanced accuracy = 1/2 (TP / P + TN / N)

  • denominator is total pos/neg

    • recall = sensitivity = true-positive rate = TP / P = TP / (TP + FN)

      • what fraction of the real positives do we return?

    • specificity = true negative rate = TN / N = TN / (TN + FP)

      • what fraction of the real negatives do we return?

    • false positive rate = FP / N \(= 1 - \text{specificity}\)

      • what fraction of the predicted negatives are wrong?

  • fraction is total predictions

    • precision = positive predictive value = TP / (TP + FP)

      • what fraction of the prediction positives are true positives?

    • negative predictive value = TN / (FN + TN)

      • what fraction of predicted negatives are true negatives?

  • F-score is harmonic mean of precision and recall: 2 * (prec * rec) / (prec + rec)

  • NRI (controversial): compares 2 model’s binary predictions

  • curves - easiest is often to just plot TP vs TN or FP vs FN

    • roc curve: true-positive rate (recall) vs. false-positive rate

      • perfect is recall = 1, false positive rate = 0

    • precision-recall curve

    • summarizing curves

      • AUC: area under (either one) of these curves - usually roc

  • concordance = inter-rate reliability

    • exact concordance - percentage where cohort is in total agreement (i.e. accuracy)

    • Cohen’s kappa coefficient - 0 is uncorrelated, 1 is perfect, negative is inverse correlation

      • \(\kappa \equiv \frac{p_{o}-p_{e}}{1-p_{e}}=1-\frac{1-p_{o}}{1-p_{e}}\)

      • \(p_o\) is relative observed agreement and \(p_e\) is chance expected agreement

    • weighted kappa coefficient - used when ordering for predicted labels (being off by more is given bigger weight)

5.10.3. comparing two things#

  • odds: \(p : \text{not }p\)

  • odds ratio is a ratio of odds

5.10.4. cv#

  • cross validation - don’t have enough data for a test set

    • properties

      • not good when n < complexity of predictor

        • because summands are correlated

      • assume data units are exchangeable

      • can sometimes use this to pick k for k-means

      • data is reused

    • types

      1. k-fold - split data into N pieces

      • N-1 pieces for fit model, 1 for test

      • cycle through all N cases

      • average the values we get for testing

      1. leave one out (LOOCV)

      • train on all the data and only test on one

      • then cycle through everything

      1. random split - shuffle and repeat

      2. one-way CV = prequential analysis - keep testing on next data point, updating model

      3. ESCV - penalize variance between folds

  • regularization path of a regression - plot each coeff vs. \(\lambda\)

    • shows when features get pushed to 0

  • for OLS (and maybe other linear models), can compute leave-one-out CV without training separate models

5.10.5. stability#

  1. computational stability

  • randomness in the algorithm

  • perturbations to models

  1. generalization stability

  • perturbations to data

  • sampling methods

    1. bootstrap - take a sample

    • repeatedly sample from observed sample w/ replacement

    • bootstrap samples has same size as observed sample

    1. subsampling

    • sample without replacement

    1. jackknife resampling

    • subsample containing all but one of the points

5.10.6. other considerations#

  • computational cost

  • interpretability

  • model-selection criteria

    • adjusted \(R^2_p\) - penalty

    • Mallow’s \(C_p\)

    • \(AIC_p\)

    • \(BIC_p\)

    • PRESS

5.10.7. nlp#

  • BLEU (BiLingual Evaluation Understudy) is a metric for automatically evaluating machine-translated text. The BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a set of high quality reference translations