unsupervised
view markdownclustering
 labels are not given
 intracluster distances are minimized, intercluster distances are maximized
 distance measures
 symmetric D(A,B)=D(B,A)
 selfsimilarity D(A,A)=0
 positivity separation D(A,B)=0 iff A=B
 triangular inequality D(A,B) <$\leq$ D(A,C)+D(B,C)
 ex. Minkowski Metrics $d(x,y)=\sqrt[r]{\sum \vert x_iy_i\vert ^r}$
 r=1 Manhattan distance
 r=1 when y is binary > Hamming distance
 r=2 Euclidean
 r=$\infty$ “sup” distance
 correlation coefficient  unit independent
 edit distance
hierarchical
 two approaches:
 bottomup agglomerative clustering  starts with each object in separate cluster then joins
 topdown divisive  starts with 1 cluster then separates
 ex. starting with each item in its own cluster, find best pair to merge into a new cluster
 repeatedly do this to make a tree (dendrogram)
 distances between clusters defined by linkage function
 singlelink  closest members (long, skinny clusters)
 completelink  furthest members (tight clusters)
 average  most widely used
 ex. MST  keep linking shortest link
 ultrametric distance  tighter than triangle inequality
 $d(x, y) \leq \max[d(x,z), d(y,z)]$
partitional
 partition n objects into a set of K clusters (must be specified)
 globally optimal: exhaustively enumerate all partitions
 minimize sum of squared distances from cluster centroid
 evaluation w/ labels  purity  ratio between dominant class in cluster and size of cluster
 kmeans++  better at not getting stuck in local minima
 randomly move centers apart
 Complexity: $O(n^2p)$ for first iteration and then can only get worse
statistical clustering (j 10)
 latent vars  values not specified in the observed data
 KMeans
 start with random centers
 E: assign everything to nearest center: $O(\text{clusters}*np) $
 M: recompute centers $O(np)$ and repeat until nothing changes
 partition amounts to Voronoi diagram

can be viewed as minimizing distortion measure $J=\sum_n \sum_i z_n^i x_n  \mu_i ^2$

GMMs: $p(x \theta) = \underset{i}{\Sigma} \pi_i \mathcal{N}(x \mu_i, \Sigma_i)$ 
$l(\theta x) = \sum_n \log : p(x_n \theta) \ = \sum_n \log \sum_i \pi_i \mathcal{N}(x_n \mu_i, \Sigma_i)$ 
hard to maximize bcause log acts on a sum
 “soft” version of Kmeans  update means as weighted sums of data instead of just normal mean
 sometimes initialize Kmeans w/ GMMs

conditional mixture models  regression/classification (j 10)
graph LR;
X>Y;
X > Z
Z > Y
 ex.
 latent variable Z has multinomial distr.

mixing proportions: $P(Z^i=1 x, \xi)$  ex. $ \frac{e^{\xi_i^Tx}}{\sum_je^{\xi_j^Tx}}$

mixture components: $p(y Z^i=1, x, \theta_i)$ ~ different choices  ex. mixture of linear regressions

$p(y x, \theta) = \sum_i \underbrace{\pi_i (x, \xi)}_{\text{mixing prop.}} \cdot \underbrace{\mathcal{N}(y \beta_i^Tx, \sigma_i^2)}_{\text{mixture comp.}}$

 ex. mixtures of logistic regressions

$p(y x, \theta_i) = \underbrace{\pi_i (x, \xi)}{\text{mixing prop.}} \cdot \underbrace{\mu(\theta_i^Tx)^y\cdot[1\mu(\theta_i^Tx)]^{1y}}{\text{mixture comp.}}$ where $\mu$ is the logistic function


 also, nonlinear optimization for this (including EM)
dim reduction
In general there is some tension between preserving global properties (e.g. PCA) and local peroperties (e.g. nearest neighborhoods)
Method  Analysis objective  Temporal smoothing  Explicit noise model  Notes 

PCA  Covariance  No  No  orthogonality 
FA  Covariance  No  Yes  like PCA, but with errors (not biased by variance) 
LDS/GPFA  Dynamics  Yes  Yes  
NLDS  Dynamics  Yes  Yes  
LDA  Classification  No  No  
Demixed  Regression  No  Yes/No  
Isomap/LLE  Manifold discovery  No  No  
TSNE  ….  ….  …  
UMAP  …  …  … 
 NMF  $\min_{D \geq 0, A \geq 0} XDA_F^2$
 SEQNMF
 LDA/QDA  finds basis that separates classes
 reduced to axes which separate classes (perpendicular to the boundaries)
 Kmeans  can be viewed as a linear decomposition
spectral clustering
 spectral clustering  does dim reduction on eigenvalues (spectrum) of similarity matrix before clustering in few dims
 uses adjacency matrix
 basically like PCA then kmeans
 performs better with regularization  add small constant to the adjacency matrix
pca
 want new set of axes (linearly combine original axes) in the direction of greatest variability
 this is best for visualization, reduction, classification, noise reduction
 assume $X$ (nxp) has zero mean

derivation:
 minimize variance of X projection onto a unit vector v
 $\frac{1}{n} \sum (x_i^Tv)^2 = \frac{1}{n}v^TX^TXv$ subject to $v^T v=1$
 $\implies v^T(X^TXv\lambda v)=0$: solution is achieved when $v$ is eigenvector corresponding to largest eigenvalue
 like minimizing perpendicular distance between data points and subspace onto which we project
 minimize variance of X projection onto a unit vector v

SVD: let $U D V^T = SVD(Cov(X))$
 $Cov(X) = \frac{1}{n}X^TX$, where X has been demeaned
 equivalently, eigenvalue decomposition of covariance matrix $\Sigma = X^TX$

each eigenvalue represents prop. of explained variance: $\sum \lambda_i = tr(\Sigma) = \sum Var(X_i)$

screeplot  eigenvalues in decreasing order, look for num dims with kink
 don’t automatically center/normalize, especially for positive data

 SVD is easier to solve than eigenvalue decomposition, can also solve other ways
 multidimensional scaling (MDS)
 based on eigenvalue decomposition
 adaptive PCA
 extract components sequentially, starting with highest variance so you don’t have to extract them all
 multidimensional scaling (MDS)
 good PCA code: http://cs231n.github.io/neuralnetworks2/
X = np.mean(X, axis = 0) # zerocenter data (nxd) cov = np.dot(X.T, X) / X.shape[0] # get cov. matrix (dxd) U, D, V = np.linalg.svd(cov) # compute svd, (all dxd) Xrot_reduced = np.dot(X, U[:, :2]) # project onto first 2 dimensions (n x 2)
 nonlinear pca
 usually uses an autoassociative neural network
topic modeling

similar, try to discover topics in a model (which maybe can be linearly combined to produce the original document)

ex. LDA  generative model: posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics
sparse coding = sparse dictionary learning
[\underset {\mathbf{D}} \min \underset t \sum \underset {\mathbf{a^{(t)}}} \min  \mathbf{x^{(t)}}  \mathbf{Da^{(t)}}  _2^2 + \lambda  \mathbf{a^{(t)}}  _1] 
 D is like autoencoder output weight matrix
 $a$ is more complicated  requires solving inner minimization problem
 outer loop is not quite lasso  weights are not what is penalized
 impose norm $D$ not too big
 algorithms
 thresholding (simplest)  do $D^Ty$ and then threshold this
 basis pursuit  change $l_0$ to $l_1$
 this will work under certain conditions (with theoretical guarantees)
 matching purusuit  greedy, find support one at a time, then look for the next one
ica
 overview
 remove correlations and higher order dependence
 all components are equally important
 like PCA, but instead of the dot product between components being 0, the mutual info between components is 0
 goals
 minimize statistical dependence between components
 maximize information transferred in a network of nonlinear units
 uses information theoretic unsupervised learning rules for neural networks
 problem  doesn’t rank features for us
 goal: want to decompose $X$ into $z$, where we assume $X = Az$
 assumptions
 independence: $P(z) = \prod_i P(z_i)$
 nongaussianity of $z$
 2 ways to get $z$ that matches these assumptions
 maximize nongaussianity of $z$  use kurtosis, negentropy
 minimize mutual info between components of $z$  use KL, max entropy
 often equivalent
 identifiability: $z$ is identifiable up to a permutation and scaling of sources when
 at most one of the sources $z_k$ is gaussian
 $A$ is fullrank
 assumptions
 ICA learns components which are completely independent, whereas PCA learns orthogonal components
 nonlinear ica: $X \approx f(z)$, where assumptions on $z$ are the same, but $f$ can be nonlinear
 to obtain identifiability, we need to restrict $f$ and/or constrain the distr of the sources $z$
 bell & sejnowski 1995 original formulation (slightly different)
 entropy maximization  try to find a nonlinear function $g(x)$ which lets you map that distr $f(x)$ to uniform
 then, that function $g(x)$ is the cdf of $f(x)$
 in ICA, we do this for higher dims  want to map distr of $x_1, …, x_p$ to $y_1, …, y_p$ where distr over $y_i$’s is uniform (implying that they are independent)
 additionally we want the map to be information preserving

mathematically: $\underset{W} \max I(x; y) = \underset{W} \max H(y)$ since $H(y x)$ is zero (there is no randomness)  assume $y = \sigma (W x)$ where $\sigma$ is elementwise
 (then S = WX, $W=A^{1}$)

requires certain assumptions so that $p(y)$ is still a distr: $p(y) = p(x) / J $ where J is Jacobian

learn W via gradient ascent $\Delta W \propto \partial / \partial W (\log J )$  there is now something faster called fast ICA
 entropy maximization  try to find a nonlinear function $g(x)$ which lets you map that distr $f(x)$ to uniform
 topographic ICA (make nearby coefficient like each other)
 interestingly, some types of selfsupervised learning perform ICA assuming certain data structure (e.g. timecontrastive learning (hyvarinen et al. 2016))
topological
 multidimensional scaling (MDS)
 given a a distance matrix, MDS tries to recover lowdim coordinates s.t. distances are preserved
 minimizes goodnessoffit measure called stress = $\sqrt{\sum (d_{ij}  \hat{d}{ij})^2 / \sum d{ij}^2}$
 visualize in low dims the similarity between individial points in highdim dataset
 classical MDS assumes Euclidean distances and uses eigenvalues
 constructing configuration of n points using distances between n objects
 uses distance matrix
 $d_{rr} = 0$
 $d_{rs} \geq 0$
 solns are invariant to translation, rotation, relfection
 solutions types
 nonmetric methods  use rank orders of distances
 invariant to uniform expansion / contraction
 metric methods  use values
 nonmetric methods  use rank orders of distances
 D is Euclidean if there exists points s.t. D gives interpoint Euclidean distances
 define B = HAH
 D Euclidean iff B is psd
 define B = HAH
 tsne preserves pairwise neighbors
 tsne tutorial
 tsne tries to match pairwise distances between the original data and the latent space data:
 original data
 distances are converted to probabilities by assuming points are means of Gaussians, then normalizing over all pairs
 variance of each Gaussian is scaled depending on the desired perplexity
 distances are converted to probabilities by assuming points are means of Gaussians, then normalizing over all pairs
 latent data
 distances are calculated using some kernel function
 tSNE uses heavytailed Student’s tdistr kernel (van der Maaten & Hinton, 2008)
 SNE use Gausian kernel (Hinton & Roweis, 2003)
 kernels have some parameters that can be picked or learned
 perplexity  how to balance between local/global aspects of data
 distances are calculated using some kernel function
 optimization  for optimization purposes, this can be decomposed into attractive/repulsive forces
 umap: Uniform Manifold Approximation and Projection for Dimension Reduction
 pacmap
misc
 NNKMeans: Dictionary Learning using NonNegative Kernel regression (shekkizhar & ortega, 2021)
 data summarization  represent large datasets by a small set of elements (e.g. kmeans)
 here, use dictionary learning instead of kmeans to summarize data
 each dictionary element is a sparse combination of inputs
 use nonnegative kernel regesion (NNK) to measure distances when designing the dictionary (shekkizar & ortega, 2020)
generative models
 overview: https://blog.openai.com/generativemodels/
 notes for deep unsupervised learning
 MLE equivalent to minimizing KL for density estimation:

$\min_\theta KL(p p_\theta) =\ \min_\thetaH(p) + \mathbb E_{x\sim p}[\log p_\theta(x)] \ \max_\theta E_p[\log p_\theta(x)]$

autoregressive models
 model input based on input
 $p(x_1)$ is a histogram (learned prior)

$p(x_2 x_1)$ is a distr. ouptut by a neural net (output is logits, followed by softmax)  all conditional distrs. can be given by neural net

can model using an RNN: e.g. charrnn (karpathy, 2015): $\log p(x)  \sum_i \log p(x_i x_{1:i1})$, where each $x_i$ is a character  can also use masks
 masked autoencoder for distr. estimation  mask some weights so that autoencoder output is a factorized distr.
 pick an odering for the pixels to be conditioned on
 ex. 1d masked convolution on wavenet (use past points to predict future points)
 ex. pixelcnn  use masking for pixels to the topleft
 ex. gated pixelcnn  fixes issue with blindspot
 ex. pixelcnn++  nearby pixel values are likely to cooccur
 ex. pixelSNAIL  uses attention and can get wider receptive field
 attention:$A(q, K, V) = \sum_i \frac{\exp(q \cdot k_i)}{\sum_j \exp (q \cdot k_j)} v_i$
 masked attention can be more flexible than masked convolution
 can do super resolution, hierarchical autoregressive model
 masked autoencoder for distr. estimation  mask some weights so that autoencoder output is a factorized distr.
 problems
 slow  have to sample each pixel (can speed up by caching activations)
 can also speed up by assuming some pixels conditionally independent
 slow  have to sample each pixel (can speed up by caching activations)
 hard to get a latent reprsentation
 can use Fisher score $\nabla_\theta \log p_\theta (x)$
flow models
 good intro to implementing invertible neural networks: https://hci.iwr.uniheidelberg.de/vislearn/inverseproblemsinvertibleneuralnetworks/
 input / output dimension need to have same dimension
 we can get around this by padding one of the dimensions with noise variables (and we might want to penalize these slightly during training)
 normalizing flows
 ultimate goal: a likelihoodbased model with
 fast sampling
 fast inference (evaluating the likelihood)
 fast training
 good samples
 good compression
 transform some $p(x)$ to some $p(z)$
 $x \to z = f_\theta (x)$, where $z \sim p_Z(z)$
 $p_\theta (x) dx = p(z)dz$

$p_\theta(x) = p(f_\theta(x)) \frac {\partial f_\theta (x)}{\partial x} $
 autoregressive flows
 map $x\to z$ invertible
 $x \to z$ is same as loglikelihood computation
 $z\to x$ is like sampling
 end up being as deep as the number of variables
 map $x\to z$ invertible
 realnvp (dinh et al. 2017)  can couple layers to preserve invertibility but still be tractable
 downsample things and have different latents at different spatial scales
 other flows
 flow++
 glow
 FFJORD  continuous time flows
 discrete data can be harder to model
 dequantization  add noise (uniform) to discrete data
vaes
 intuitively understanding vae
 VAE tutorial

minimize $\mathbb E_{q_\phi(z x)}[\log p_\theta(x z) D_{KL}(q_\phi(z x): :p(z))]$  want latent $z$ to be standard normal  keeps the space smooth

hard to directly calculate $p(z x)$, since it includes $p(x)$, so we approximate it with the variational posterior $q_\phi (z x)$, which we assume to be Gaussian 
goal: $\text{argmin}\phi KL(q\phi(z x) : :p(z x))$ 
still don’t have acess to $p(x)$, so rewrite $\log p(x) = ELBO(\phi) + KL(q_\phi(z x) : : p(z x))$ 
instead of minimizing $KL$, we can just maximize the $ELBO=\mathbb E_q [\log p(x, z)]  \mathbb E_q[\log q_\phi (z x)]$

 meanfield variational inference  each point has its own params (e.g. different encoder DNN) vs amortized inference  same encoder for all points

 pyro explanation
 want large evidence $\log p_\theta (\mathbf x)$ (means model is a good fit to the data)

want good fit to the posterior $q_\phi(z x)$
 just an autoencoder where the middle hidden layer is supposed to be unit gaussian
 add a kl loss to measure how well it matches a unit gaussian
 for calculation purposes, encoder actually produces means / vars of gaussians in hidden layer rather than the continuous values….
 this kl loss is not too complicated…https://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf
 add a kl loss to measure how well it matches a unit gaussian
 generally less sharp than GANs
 uses mse loss instead of gan loss…
 intuition: vaes put mass between modes while GANs push mass towards modes
 constraint forces the encoder to be very efficient, creating informationrich latent variables. This improves generalization, so latent variables that we either randomly generated, or we got from encoding nontraining images, will produce a nicer result when decoded.
gans
 evaluating gans
 don’t have explicit objective like likelihood anymore
 kernel density = parzenwindow density based on samples yields likelihood

inception score $IS(\mathbf x) = \exp(\underbrace{H(\mathbf y)}_{\text{want to generate diversity of classes}}  \underbrace{H(\mathbf y \mathbf x)}_{\text{each image should be distinctly recognizable}})$  FID  Frechet inception score works directly on embedded features from inception v3 model
 embed population of images and calculate mean + variance in embedding space
 measure distance between these means / variances for real/synthetic images using Frechet distance = Wasseterstein2 distance
 infogan
 problems
 mode collapse  pick just one mode in the distr.

train network to be loss function

original gan paper (2014)

generative adversarial network
 goal: want G to generate distribution that follows data
 ex. generate good images
 two models
 G  generative
 D  discriminative
 G generates adversarial sample x for D
 G has prior z
 D gives probability p that x comes from data, not G
 like a binary classifier: 1 if from data, 0 from G
 adversarial sample  from G, but tricks D to predicting 1
 training goals
 G wants D(G(z)) = 1
 D wants D(G(z)) = 0
 D(x) = 1
 converge when D(G(z)) = 1/2
 G loss function: $G = \text{argmin}_G \log(1D(G(Z))$
 overall $\min_g \max_D$ log(1D(G(Z))
 training algorithm
 in the beginning, since G is bad, only train my minimizing G loss function

projecting into gan latent space (=gan inversion)
 2 general approaches
 learn an encoder to go image > latent space
 InDomain GAN Inversion for Real Image Editing (zhu et al. 2020)
 learn encoder to project image into latent space, with regularizer to make sure it follows the right distr.
 learn an encoder to go image > latent space
 optimize latent code wrt image directly
 can also learn an encoder to initialize this optimization
 some work designing GANs that are intrinsically invertible
 styleganspecific  some works which exploit layerwise noises
 stylegan2 paper: optimize w along with noise maps  need to make sure noise maps don’t include signal
 2 general approaches
diffusion / energybased models
 first describe a procedure for gradually turning data into noise
 then train a DNN to invert this procedure stepbystep
 single model handles many different noise levels with shared parameters
 blog post
 seminal paper: Generative Modeling by Estimating Gradients of the Data Distribution (song & ermon, 2019)
 really started earlier: Deep Unsupervised Learning using Nonequilibrium Thermodynamics (sohldickstein, …, ganguli, 2015)
 Improved Denoising Diffusion Probabilistic Models (2021)
 can make this classconditionnal by incorporating classifier into the model which inverts the noise
 Diffusion Models Beat GANs on Image Synthesis (2021)
self/semisupervised
selfsupervised
 blog: https://dawn.cs.stanford.edu/2017/07/16/weaksupervision/
 training data is hard to get
 related paper: https://www.biorxiv.org/content/early/2018/06/16/339630
 basics: predict some part of the input (e.g. present from past, bottom from top, etc.)
 ex. denoising autoencoder
 ex. inpainting (can use adversarial loss)
 ex. colorization, splitbrain autoencoder
 colorization in video given first frame (helps learn tracking)
 ex. relative patch prediction
 ex. orientation prediction
 ex. nlp
 word2vec
 bert  predict blank word
 contrastive predictive coding  translates generative modeling into classification
 contrastive loss = InfoNCE loss uses crossentropy loss to measure how well the model can classify the “future” representation among a set of unrelated “negative” samples
 negative samples may be from other batches or other parts of the input
 momentum contrast  queue of previous embeddings are “keys”
 match new embedding (query) against keys and use contrastive loss
 similar idea as memory bank
 Unsupervised Visual Representation Learning by Context Prediction (efros 15)
 predict relative location of different patches
 SimCLR (Chen et al, 2020)
 maximize agreement for different points after some augmentation (contrastive loss)
semisupervised
 make the classifier more confident
 entropy minimization  try to minimize the entropy of output predictions (like making confident predictions labels)
 pseudo labeling  just take argmax pred as if it were the label
 label consistency with data augmentation
 Billionscale semisupervised learning for image classification (FAIR, 19)
 unsupervised learning + model distillation succeeds on imagenet
 ensembling
 temporal ensembling  ensemble multiple models at different training epochs
 mean teachers  learn from exponential moving average of students
 unsupervised data augmentation  augment and ensure prediction is the same
 distribution alignment  ex. cyclegan  enforce cycle consistency = dual learning = back translation
 simpler is marginal matching
compression
 simplest  fixedlength code
 variablelength code
 could append stop char to each codeword
 general prefixfree code = binary tries
 codeword is path from froot to leaf
 huffman code  higher prob = shorter
 arithmetic coding
 motivation: coding one symbol at a time incurs penalty of +1 per symbol  more efficient to encode groups of things
 can be improved with good autoregressive model
contrastive learning
 What makes for good views for contrastive learning (tian et al. 2020)
 how to select views (e.g. transformations we want to be invariant to)?
 reduce the mutual information (MI) between views while keeping taskrelevant information intact
 Supervised Contrastive Learning (khosla et al. 2020)
 DataEfficient Image Recognition with Contrastive Predictive Coding
 pretraining with CPC on ImageNet improves accuracy
 Automatically Discovering and Learning New Visual Categories with Ranking Statistics