reproducibility

6.11. reproducibility#

dvc - data version control
- .dvc folder keeps track of some internal stuff (like .git)
- metafiles ending with .dvc are stored in git, tracking big things like data and models
- also simple support for keeping track of metrics, displaying pipeline, making plots
- keep track of old things using git checkout and dvc checkout
- dagshub - built on dvc, like github (gives ~10GB free storage per project)
  - “Our recommendation is to separate distinct experiments (for example, different types of models) into separate branches, while smaller changes between runs (for example, changing model parameters) are consecutive commits on the same branch.”
  - not open source :frowning_face:
replicate.ai - version control for ml
- lightweight, focuses on tracking model weights / sharing + dependencies
- less about hyperparams
mlflow (open-source) from databricks
- API and UI for logging parameters, code versions, metrics and output files
clear-ml
gigantum - like a closed-source dagshub
codalab - good framework for reproducibility
paid / closed-source
- weights and biases (free for academics, paid otherwise)
- neptune.ai
- h20 ai (source here)

prefect
- tasks are basically functions
- flows are used to describe the dependencies between tasks, such as their order or how they pass data around