A checklist for model improvement view markdown
A checklist of things to check that may be able to help improve a data-science workflow.
- is there any dependence within data splits (e.g. temporal correlations) that would artificially effect your accuracy estimates?
- look at histograms of outcomes / key features
- see if features can be easily reduced to lower dimensions (e.g. PCA, LDA)
- normalizing features and output
- balance the data (random sampling, random sampling + ensembling, smote, etc.)
- do feature selection with simple screening (e.g. variance, correlation, etc)
- do feature selection using a model (e.g. tree, lasso)
- can the model achieve zero training error on a single example?
- how do simple baselines (e.g. linear models, decision trees) perform?
- visualize the outputs of dim reduction / transformations (e.g. pca, sparse coding, nmf) on your features
- for correlated features, group them together or extract out a more stable feature
- try simple rules to cover most of the cases
- try ensembling
- do feature importances match what you would expect?
- plot predictions vs groundtruth
- try to detect outliers
- visualize the examples with the largest prediction error