A checklist for model improvement view markdown


A checklist of things to check that may be able to help improve a data-science workflow.

data splitting

  • is there any dependence within data splits (e.g. temporal correlations) that would artificially effect your accuracy estimates?

visualizing data

  • look at histograms of outcomes / key features
  • see if features can be easily reduced to lower dimensions (e.g. PCA, LDA)

preprocessing

  • normalizing features and output
  • balance the data (random sampling, random sampling + ensembling, smote, etc.)
  • do feature selection with simple screening (e.g. variance, correlation, etc)
  • do feature selection using a model (e.g. tree, lasso)

debugging

  • can the model achieve zero training error on a single example?
  • how do simple baselines (e.g. linear models, decision trees) perform?

feature engineering

  • visualize the outputs of dim reduction / transformations (e.g. pca, sparse coding, nmf) on your features
  • for correlated features, group them together or extract out a more stable feature

modeling

  • try simple rules to cover most of the cases
  • try ensembling

feature importances

  • do feature importances match what you would expect?

analyzing erros

  • plot predictions vs groundtruth
  • try to detect outliers
  • visualize the examples with the largest prediction error