Chandan Singh | Useful ml / data-science packages and tips

Useful ml / data-science packages and tips view markdown

These are some packages/tips for machine-learning coding in python.

Machine learning code gets messy fast. In contrast to other tasks, machine learning often require a large number of variations of very similar models that can often be difficult to keep track of. Here are some tips on coding for machine learning, assuming you already know the basics (e.g. numpy, scikit-learn, etc.) and have selected an appropriate framework for your problem (e.g. pytorch):

very useful packages

tqdm: add a loading bar to any loop in a super easy way:

for i in tqdm(range(10000)):
	...

displays

76%|████████████████████████████         | 7568/10000 [00:33<00:10, 229.00it/s]

h5py: a great way to read/write to arrays which are too big to store in memory, as if they were in memory
- pyarrow: good for storing metadata as well (like dataframes)
slurmpy: lets you submit jobs to slurm using just python, so you never need to write bash scripts.
pandas: provides dataframes to python - often overlooked for big data which might not fit into DataFrames that fit into memory. Still very useful for comparing results of models, particularly with many hyperparameters.
- modin - drop in pandas replacement to speed up operations
- df.to_latex()
pandas-profiling - very quick data overviews
pandas bamboolib - ui for pandas, in development
dovpanda - helpful hints for using pandas
imodels - fitting simple models
tabnine - autocomplete for jupyter
cloud9sdk can be a useful subsitute for jupyterhub
thinc - interoperable dl framework
napari - image viewer
python-fire - passing cmd line args
auto-sklearn - automatically select hyperparams / classifiers using bayesian optimization
venv - manage your python packages (or something similar like pipenv)

computation

dask - natively scales python
joblib - caches intermediate computations
jax - high-performance python + numpy
numba - alternative to dask, just requires adding decorators to functions
some tips for using jupyter
- useful shortcuts: tab, shift+tab: inspect something

plotting

matplotlib - basic plotting in python
animatplot - animates plots in matplotlib
seaborn - makes quick and beautiful plots for easy data exploration, although may not be best for final plots
bokeh - interactive visualization library, examples
plotly - make interactive plots

documenting / deploying

dvc - version control for data science
streamlit - building interactive application
gradio yields nice web interface for getting model predictions
pdoc3 can very quickly generate simple api from docstrings
kubeflow

deep learning

mmdnn - converts between dl frameworks

general tips

installing things: using pip/conda is generally the best way to install things. If you’re running into permission errors pip install --user tends to fix a lot of common problems (best practice is to use a separate virtualenv or pipenv for each project)
make classes for datasets/dataloaders: wrapping data loading/preprocessing allows your code to be much cleaner and more modular. It also lets your models easily be adapted to different datasets. Pytorch has a good tutorial on how to do this (although the same principles apply without using pytorch.
store hyperparameters: when you test many different sets of hyperparameters, it is difficult to easily map which hyperparameters correspond to which results. It’s important to store hyperparameters in a easily readable way, such as saving an argparse object, or storing/saving parameters in a class you define yourself.

environment

vscode (with jupyter support) is the best ide for data science
- ~~it’s hard to pick a good ide for data science. jupyter notebooks are great for exploratory analysis, while more fully built ides like pycharm or vscode are better for large-scale projects~~
- ~~using atom with the hydrogen plugin often strikes a nice balance~~ (sadly no longer maintained 😢)
github copilot is a ~~nice~~ critical add-in
jupytertext offers a nice way to use version control with jupyter
when working on AWS, this command is useful for starting remote jupyterlab sessions screen jupyter lab --certfile=~/ssl/mycert.pem --keyfile ~/ssl/mykey.key

hyperparameter tracking

it can often be messy to keep track of ml experiments
often I like to create a class which is basically a dictionary for params I want to vary / save and then save those to a dict (ex here), but this only works for relatively small projects
trains by allegroai seems to be a promising experiment manager
reddit thread detailing different tracking frameworks

command line utils

ctrl-a: HOME, ctrl-e: END
git add folder/\*.py

interpretability cheat-sheets

presenting

reveal-md
manim

vim shortcuts

remap esc -> jk
J - remove whitespace
ctrl+r - redo
o - new empty line below (in insert mode)
O - new empty line above (in insert mode)
e - move to end of word
use h and l!
r - replace command
R - enter replace mode (deletes as it inserts)
c - change operator (basically delete and then insert)
- use with position (ex. ce)
W, B, gE, E - move by white-space separated words
ctrl+o - go back, end search
% - matching parenthesis
- can find and substitute
vim extensive
- ! ls

packaging projects

good reference
- python setup.py sdist bdist_wheel
- twine check dist/*
- twine upload dist/*

misc services

magic wormhole: easily send files between computers
google buckets: just type gsutil and then can type a linux command (e.g. ls, du)

tmux shortcuts

remap so clicking/scrolling works
use ctrl+b to do things
tmux ls

installation

pip install --user

bookdown - write books in R markdown
jupyter-book - write books with markdown + jupyter

data

cool analysis / data from BuzzFeed here

js demo libraries

reference

this repo and this repo have really useful lists
feel free to use/share this openly
for similar projects, see some other repos: (e.g. acd) or my collection of resources