reproducibility
Contents
2.10. reproducibility#
Some notes on different programming techniques / frameworks for reproducibility
2.10.1. containerization#
e.g. docker
2.10.2. data version control#
dvc - data version control
.dvc folder keeps track of some internal stuff (like .git)
metafiles ending with .dvc are stored in git, tracking big things like data and models
also simple support for keeping track of metrics, displaying pipeline, making plots
keep track of old things using git checkout and dvc checkout
dagshub - built on dvc, like github (gives ~10GB free storage per project)
“Our recommendation is to separate distinct experiments (for example, different types of models) into separate branches, while smaller changes between runs (for example, changing model parameters) are consecutive commits on the same branch.”
not open source :frowning_face:
replicate.ai - version control for ml
lightweight, focuses on tracking model weights / sharing + dependencies
less about hyperparams
mlflow (open-source) from databricks
API and UI for logging parameters, code versions, metrics and output files
gigantum - like a closed-source dagshub
codalab - good framework for reproducibility
paid / closed-source
weights and biases (free for academics, paid otherwise)
2.10.3. hyperparameter tuning#
weaker versions
tensorboard (mainly for deep learning)
pytorch-lightning + hydra
2.10.4. weights and biases#
wandb.login()
- login to W&B at the start of your sessionwandb.init()
- initialise a new W&B, returns a “run” objectwandb.log()
- logs whatever you’d like to log
2.10.5. workflow management#
-
tasks are basically functions
flows are used to describe the dependencies between tasks, such as their order or how they pass data around