2.10. reproducibility#

Some notes on different programming techniques / frameworks for reproducibility

2.10.1. containerization#

2.10.2. data version control#

  • dvc - data version control

    • .dvc folder keeps track of some internal stuff (like .git)

    • metafiles ending with .dvc are stored in git, tracking big things like data and models

    • also simple support for keeping track of metrics, displaying pipeline, making plots

    • keep track of old things using git checkout and dvc checkout

    • dagshub - built on dvc, like github (gives ~10GB free storage per project)

      • “Our recommendation is to separate distinct experiments (for example, different types of models) into separate branches, while smaller changes between runs (for example, changing model parameters) are consecutive commits on the same branch.”

      • not open source :frowning_face:

  • replicate.ai - version control for ml

    • lightweight, focuses on tracking model weights / sharing + dependencies

    • less about hyperparams

  • mlflow (open-source) from databricks

    • API and UI for logging parameters, code versions, metrics and output files

  • clear-ml

  • gigantum - like a closed-source dagshub

  • codalab - good framework for reproducibility

  • paid / closed-source

2.10.3. hyperparameter tuning#

  • weaker versions

  • pytorch-lightning + hydra

  • ray

2.10.4. weights and biases#

  • wandb.login() - login to W&B at the start of your session

  • wandb.init() - initialise a new W&B, returns a “run” object

  • wandb.log() - logs whatever you’d like to log

2.10.5. workflow management#

  • prefect

    • tasks are basically functions

    • flows are used to describe the dependencies between tasks, such as their order or how they pass data around