A checklist for model improvement view markdown
A checklist of things to check that may be able to help improve a data-science workflow.
data splitting
- is there any dependence within data splits (e.g. temporal correlations) that would artificially effect your accuracy estimates?
visualizing data
- look at histograms of outcomes / key features
- see if features can be easily reduced to lower dimensions (e.g. PCA, LDA)
preprocessing
- normalizing features and output
- balance the data (random sampling, random sampling + ensembling, smote, etc.)
- do feature selection with simple screening (e.g. variance, correlation, etc)
- do feature selection using a model (e.g. tree, lasso)
debugging
- can the model achieve zero training error on a single example?
- how do simple baselines (e.g. linear models, decision trees) perform?
feature engineering
- visualize the outputs of dim reduction / transformations (e.g. pca, sparse coding, nmf) on your features
- for correlated features, group them together or extract out a more stable feature
modeling
- try simple rules to cover most of the cases
- try ensembling
feature importances
- do feature importances match what you would expect?
analyzing erros
- plot predictions vs groundtruth
- try to detect outliers
- visualize the examples with the largest prediction error