useful ml / data-science packages and tips

view markdown


These are some packages/tips for machine-learning coding in python.

Machine learning code gets messy fast. In contrast to other tasks, machine learning often require a large number of variations of very similar models that can often be difficult to keep track of. Here are some tips on coding for machine learning, assuming you already know the basics (e.g. numpy, scikit-learn, etc.) and have selected an appropriate framework for your problem (e.g. pytorch):

very useful packages

  • tqdm: add a loading bar to any loop in a super easy way:
for i in tqdm(range(10000)):
	...

displays

76%|████████████████████████████         | 7568/10000 [00:33<00:10, 229.00it/s]
  • h5py: a great way to read/write to arrays which are too big to store in memory, as if they were in memory
    • pyarrow: good for storing metadata as well (like dataframes)
  • slurmpy: lets you submit jobs to slurm using just python, so you never need to write bash scripts.
  • pandas: provides dataframes to python - often overlooked for big data which might not fit into DataFrames that fit into memory. Still very useful for comparing results of models, particularly with many hyperparameters.
    • modin - drop in pandas replacement to speed up operations
  • pandas-profiling - very quick data overviews
  • pandas bamboolib - ui for pandas, in development
  • dovpanda - helpful hints for using pandas
  • imodels - fitting simple models
  • tabnine - autocomplete for jupyter
  • cloud9sdk can be a useful subsitute for jupyterhub
  • thinc - interoperable dl framework
  • napari - image viewer
  • python-fire - passing cmd line args
  • auto-sklearn - automatically select hyperparams / classifiers using bayesian optimization

computation

  • dask - natively scales python
  • joblib - caches intermediate computations
  • jax - high-performance python + numpy
  • numba - alternative to dask, just requires adding decorators to functions
  • some tips for using jupyter
    • useful shortcuts: tab, shift+tab: inspect something

plotting

  • matplotlib - basic plotting in python
  • animatplot - animates plots in matplotlib
  • seaborn - makes quick and beautiful plots for easy data exploration, although may not be best for final plots
  • bokeh - interactive visualization library, examples
  • plotly - make interactive plots

documenting / deploying

  • dvc - version control for data science
  • streamlit - building interactive application
  • gradio yields nice web interface for getting model predictions
  • pdoc3 can very quickly generate simple api from docstrings
  • kubeflow

deep learning

  • mmdnn - converts between dl frameworks

general tips

  • installing things: using pip/conda is generally the best way to install things. If you’re running into permission errors pip install --user tends to fix a lot of common problems
  • make classes for datasets/dataloaders: wrapping data loading/preprocessing allows your code to be much cleaner and more modular. It also lets your models easily be adapted to different datasets. Pytorch has a good tutorial on how to do this (although the same principles apply without using pytorch.
  • store hyperparameters: when you test many different sets of hyperparameters, it is difficult to easily map which hyperparameters correspond to which results. It’s important to store hyperparameters in a easily readable way, such as saving an argparse object, or storing/saving parameters in a class you define yourself.

environment

  • it’s hard to pick a good ide for data science. jupyter notebooks are great for exploratory analysis, while more fully built ides like pycharm are better for large-scale projects
  • using atom with the hydrogen plugin often strikes a nice balance
  • jupytertext offers a nice way to use version control with jupyter
  • when working on AWS, this command is useful for starting remote jupyterlab sessions screen jupyter lab --certfile=~/ssl/mycert.pem --keyfile ~/ssl/mykey.key

hyperparameter tracking

  • it can often be messy to keep track of ml experiments
  • often I like to create a class which is basically a dictionary for params I want to vary / save and then save those to a dict (ex here), but this only works for relatively small projects
  • trains by allegroai seems to be a promising experiment manager
  • reddit thread detailing different tracking frameworks

command line utils

  • ctrl-a: HOME, ctrl-e: END
  • git add folder/\*.py

interpretability cheat-sheets

vim shortcuts

  • remap esc -> jk
  • J - remove whitespace
  • ctrl+r - redo
  • o - new empty line below (in insert mode)
  • O - new empty line above (in insert mode)
  • e - move to end of word
  • use h and l!
  • r - replace command
  • R - enter replace mode (deletes as it inserts)
  • c - change operator (basically delete and then insert)
    • use with position (ex. ce)
  • W, B, gE, E - move by white-space separated words
  • ctrl+o - go back, end search
  • % - matching parenthesis
    • can find and substitute
  • vim extensive
    • ! ls

packaging projects

  • good reference
    • python setup.py sdist bdist_wheel
    • twine upload dist/*

misc services

  • google buckets: just type gsutil and then can type a linux command (e.g. ls, du)

tmux shortcuts

  • remap so clicking/scrolling works
  • use ctrl+b to do things
  • tmux ls

installation

  • pip install --user

sharing

data

  • cool analysis / data from BuzzFeed here

reference