Chandan Singh | Useful ml / data-science packages and tips

useful ml / data-science packages and tips

view markdown


These are some packages/tips for machine-learning coding in python.

Machine learning code gets messy fast. In contrast to other tasks, machine learning often require a large number of variations of very similar models that can often be difficult to keep track of. Here are some tips on coding for machine learning, assuming you already know the basics (e.g. numpy, scikit-learn, etc.) and have selected an appropriate framework for your problem (e.g. pytorch):

very useful packages

  • tqdm: add a loading bar to any loop in a super easy way:
for i in tqdm(range(10000)):
	...

displays

76%|████████████████████████████         | 7568/10000 [00:33<00:10, 229.00it/s]
  • h5py: a great way to read/write to arrays which are too big to store in memory, as if they were in memory
  • slurmpy: lets you submit jobs to slurm using just python, so you never need to write bash scripts.
  • pandas: provides dataframes to python - often overlooked for big data which might not fit into DataFrames that fit into memory. Still very useful for comparing results of models, particularly with many hyperparameters.
    • modin - drop in pandas replacement to speed up operations
  • pandas-profiling - very quick data overviews
  • pandas bamboolib - ui for pandas, in development
  • dovpanda - helpful hints for using pandas
  • imodels - fitting simple models
  • tabnine - autocomplete for jupyter
  • cloud9sdk can be a useful subsitute for jupyterhub
  • thinc - interoperable dl framework
  • napari - image viewer

computation

  • dask - natively scales python
  • jax - high-performance python + numpy
  • numba - alternative to dask, just requires adding decorators to functions
  • some tips for using jupyter
    • useful shortcute tab, shift+tab: inspect something

plotting

  • matplotlib - basic plotting in python
  • animatplot - animates plots in matplotlib
  • seaborn - makes quick and beautiful plots for easy data exploration, although may not be best for final plots
  • bokeh - interactive visualization library, examples
  • plotly - make interactive plots

deploying

  • streamlit - building interactive application

general tips

  • installing things: using pip/conda is generally the best way to install things. If you’re running into permission errors pip install --user tends to fix a lot of common problems
  • make classes for datasets/dataloaders: wrapping data loading/preprocessing allows your code to be much cleaner and more modular. It also lets your models easily be adapted to different datasets. Pytorch has a good tutorial on how to do this (although the same principles apply without using pytorch.
  • store hyperparameters: when you test many different sets of hyperparameters, it is difficult to easily map which hyperparameters correspond to which results. It’s important to store hyperparameters in a easily readable way, such as saving an argparse object, or storing/saving parameters in a class you define yourself.

environment

  • it’s hard to pick a good ide for data science. jupyter notebooks are great for exploratory analysis, while more fully built ides like pycharm are better for large-scale projects
  • using atom with the hydrogen plugin often strikes a nice balance
  • jupytertext offers a nice way to use version control with jupyter
  • pdoc3 can very quickly generate simple api from docstrings
  • when working on AWS, this command is useful for starting remote jupyterlab sessions screen jupyter lab --certfile=~/ssl/mycert.pem --keyfile ~/ssl/mykey.key

hyperparameter tracking

  • it can often be messy to keep track of ml experiments
  • often I like to create a class which is basically a dictionary for params I want to vary / save and then save those to a dict (ex here), but this only works for relatively small projects
  • trains by allegroai seems to be a promising experiment manager
  • reddit thread detailing different tracking frameworks

command line utils

  • ctrl-a: HOME, ctrl-e: END
  • git add folder/\*.py

sharing

  • gradio yields nice web interface for getting model predictions

cheat-sheets

converting between dl frameworks

vim shortcuts

  • remap esc -> jk
  • J - remove whitespace
  • ctrl+r - redo
  • o - new empty line below (in insert mode)
  • O - new empty line above (in insert mode)
  • e - move to end of word
  • use h and l!
  • r - replace command
  • R - enter replace mode (deletes as it inserts)
  • c - change operator (basically delete and then insert)
    • use with position (ex. ce)
  • W, B, gE, E - move by white-space separated words
  • ctrl+o - go back, end search
  • % - matching parenthesis
    • can find and substitute
  • vim extensive
    • ! ls

misc services

  • google buckets: just type gsutil and then can type a linux command (e.g. ls, du)

tmux shortcuts

  • remap so clicking/scrolling works
  • use ctrl+b to do things
  • tmux ls

installation

  • pip use --user

sharing

data

  • cool analysis / data from BuzzFeed here

scaling

reference