Useful ml / data-science packages and tips view markdown
These are some packages/tips for machine-learning coding in python.
Machine learning code gets messy fast. In contrast to other tasks, machine learning often require a large number of variations of very similar models that can often be difficult to keep track of. Here are some tips on coding for machine learning, assuming you already know the basics (e.g. numpy, scikit-learn, etc.) and have selected an appropriate framework for your problem (e.g. pytorch):
very useful packages
- tqdm: add a loading bar to any loop in a super easy way:
for i in tqdm(range(10000)):
...
displays
76%|████████████████████████████ | 7568/10000 [00:33<00:10, 229.00it/s]
- h5py: a great way to read/write to arrays which are too big to store in memory, as if they were in memory
- pyarrow: good for storing metadata as well (like dataframes)
- slurmpy: lets you submit jobs to slurm using just python, so you never need to write bash scripts.
- pandas: provides dataframes to python - often overlooked for big data which might not fit into DataFrames that fit into memory. Still very useful for comparing results of models, particularly with many hyperparameters.
- modin - drop in pandas replacement to speed up operations
df.to_latex()
- pandas-profiling - very quick data overviews
- pandas bamboolib - ui for pandas, in development
- dovpanda - helpful hints for using pandas
- imodels - fitting simple models
- tabnine - autocomplete for jupyter
- cloud9sdk can be a useful subsitute for jupyterhub
- thinc - interoperable dl framework
- napari - image viewer
- python-fire - passing cmd line args
- auto-sklearn - automatically select hyperparams / classifiers using bayesian optimization
- venv - manage your python packages (or something similar like pipenv)
computation
- dask - natively scales python
- joblib - caches intermediate computations
- jax - high-performance python + numpy
- numba - alternative to dask, just requires adding decorators to functions
- some tips for using jupyter
- useful shortcuts:
tab
,shift+tab
: inspect something
- useful shortcuts:
plotting
- matplotlib - basic plotting in python
- animatplot - animates plots in matplotlib
- seaborn - makes quick and beautiful plots for easy data exploration, although may not be best for final plots
- bokeh - interactive visualization library, examples
- plotly - make interactive plots
documenting / deploying
- dvc - version control for data science
- streamlit - building interactive application
- gradio yields nice web interface for getting model predictions
- pdoc3 can very quickly generate simple api from docstrings
- kubeflow
deep learning
- mmdnn - converts between dl frameworks
general tips
- installing things: using pip/conda is generally the best way to install things. If you’re running into permission errors
pip install --user
tends to fix a lot of common problems (best practice is to use a separatevirtualenv
orpipenv
for each project) - make classes for datasets/dataloaders: wrapping data loading/preprocessing allows your code to be much cleaner and more modular. It also lets your models easily be adapted to different datasets. Pytorch has a good tutorial on how to do this (although the same principles apply without using pytorch.
- store hyperparameters: when you test many different sets of hyperparameters, it is difficult to easily map which hyperparameters correspond to which results. It’s important to store hyperparameters in a easily readable way, such as saving an argparse object, or storing/saving parameters in a class you define yourself.
environment
- vscode (with jupyter support) is the best ide for data science
- github copilot is a
nicecritical add-in - jupytertext offers a nice way to use version control with jupyter
- when working on AWS, this command is useful for starting remote jupyterlab sessions
screen jupyter lab --certfile=~/ssl/mycert.pem --keyfile ~/ssl/mykey.key
hyperparameter tracking
- it can often be messy to keep track of ml experiments
- often I like to create a class which is basically a dictionary for params I want to vary / save and then save those to a dict (ex here), but this only works for relatively small projects
- trains by allegroai seems to be a promising experiment manager
- reddit thread detailing different tracking frameworks
command line utils
ctrl-a
: HOME,ctrl-e
: ENDgit add folder/\*.py
interpretability cheat-sheets
presenting
- reveal-md
- manim
vim shortcuts
- remap esc -> jk
- J - remove whitespace
- ctrl+r - redo
- o - new empty line below (in insert mode)
- O - new empty line above (in insert mode)
- e - move to end of word
- use h and l!
- r - replace command
- R - enter replace mode (deletes as it inserts)
- c - change operator (basically delete and then insert)
- use with position (ex. ce)
- W, B, gE, E - move by white-space separated words
- ctrl+o - go back, end search
- % - matching parenthesis
- can find and substitute
- vim extensive
- ! ls
packaging projects
- good reference
python setup.py sdist bdist_wheel
twine check dist/*
twine upload dist/*
misc services
- magic wormhole: easily send files between computers
- google buckets: just type gsutil and then can type a linux command (e.g. ls, du)
tmux shortcuts
- remap so clicking/scrolling works
- use ctrl+b to do things
- tmux ls
installation
- pip install
--user
sharing
- bookdown - write books in R markdown
- jupyter-book - write books with markdown + jupyter
data
- cool analysis / data from BuzzFeed here
js demo libraries
reference
- this repo and this repo have really useful lists
- feel free to use/share this openly
- for similar projects, see some other repos: (e.g. acd) or my collection of resources