machine learning

press esc to navigate slides

Section	Topic
general	intro, linear algebra, gaussian, parameter estimation, bias-variance
regression	lin reg, LS, kernels, sparsity
dim reduction	dim reduction
classification	discr. vs. generative, nearest neighbor, DNNs, log. regression, lda/qda, decision trees, svms
optimization	problems, algorithms, duality, boosting, em

linear algebra review

matrix properties

nonsingular = invertible = nonzero determinant = null space of zero square
- $\implies$ rank = dimension
- ill-conditioned matrix - close to being singular - very small determinant

vector norms

$L_p-$ norms: $||x||_p = (\sum_i |x_i|^p)^{1/p}$
usually means
- $||x||^2 = x^Tx$

matrix norms

nuclear norm: $||X||_* = \sum_i \sigma_i$
frobenius norm = euclidean norm: $||X||_F^2 = \sqrt{\sum_i \sigma_i^2}$
spectral norm = $L_2$ -norm: $||X||_2 = \underset{\max}{\sigma}(X)$

cauchy-shwartz inequality

$|x^T y| \leq ||x||_2 ||y||_2$

equivalent to the triangle inequality $||x+y||_2^2 \leq (||x||_2 + ||y||_2)^2$

jensen's inequality

$f(E[X]) \leq E[f(X)]$ for convex f

eigenvalues

eigenvalue eqn:
- $\det(A-\lambda I) = 0$ yields characteristic polynomial

evd

diagonalization = eigenvalue decomposition = spectral decomposition
assume A (nxn) is symmetric
- $A = Q \Lambda Q^T$
- Q := eigenvectors as columns, Q is orthonormal
- $\Lambda$ diagonal

svd

nxp matrix: $X=U \Sigma V^T$

cols of U (nxn) are eigenvectors of $XX^T$
cols of V (pxp) are eigenvectors of $X^TX$
r singular values on diagonal of (nxp)
- square roots of nonzero eigenvalues of both $XX^T$ and $X^TX$

eigen stuff

expressions when
- $\det(A) = \prod_i \lambda_i$
- $tr(A) = \sum_i \lambda_i$
- $\underset{\max}{\lambda}(A) = \sup_{x \neq 0} \frac{x^T A x}{x^T x}$
- $\underset{\min}{\lambda}(A) = \inf_{x \neq 0} \frac{x^T A x}{x^T x}$

positive semi-definite (psd)

defn 1: all eigenvalues are nonnegative
defn 2: $x^TAx \geq 0 \:\forall x \in R^n$

psd notation

vectors: $x \preceq y$ means x is less than y elementwise
matrices: means is PSD
- $v^TXv \leq v^TYv \:\: \forall v$

matrix calculus

gradient vector $\nabla_x f(x)$ - partial derivatives with respect to each element of function

jacobian

function f: $\mathbb{R}^n \to \mathbb{R}^m$ Jacobian matrix :

J = [\begin{matrix} \frac{\partial f}{\partial x_{1}} & \dots & \frac{\partial f}{\partial x_{n}} \end{matrix}]

$\mathbf J= \begin{bmatrix} \dfrac{\partial \mathbf{f}}{\partial x_1} & \cdots & \dfrac{\partial \mathbf{f}}{\partial x_n} \end{bmatrix}$

= [\begin{matrix} \frac{\partial f_{1}}{\partial x_{1}} & \dots & \frac{\partial f_{1}}{\partial x_{n}} \\ ⋮ & ⋱ & ⋮ \\ \frac{\partial f_{m}}{\partial x_{1}} & \dots & \frac{\partial f_{m}}{\partial x_{n}} \end{matrix}]

$= \begin{bmatrix} \dfrac{\partial f_1}{\partial x_1} & \cdots & \dfrac{\partial f_1}{\partial x_n}\\ \vdots & \ddots & \vdots\\ \dfrac{\partial f_m}{\partial x_1} & \cdots & \dfrac{\partial f_m}{\partial x_n} \end{bmatrix}$

hessian

function f: $\mathbb{R}^n \to \mathbb{R}$

H = \nabla^{2} f (x)_{i j} = \frac{\partial^{2} f (x)}{\partial x_{i} \partial x_{j}}

$\mathbf H = \nabla^2 f(x)_{ij} = \frac{\partial^2 f(x)}{\partial x_i \partial x_j}$

 $= [\begin{matrix} \frac{\partial^{2} f}{\partial x_{1}^{2}} & \frac{\partial^{2} f}{\partial x_{1} \partial x_{2}} & \dots & \frac{\partial^{2} f}{\partial x_{1} \partial x_{n}} \\ \frac{\partial^{2} f}{\partial x_{2} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{2}^{2}} & \dots & \frac{\partial^{2} f}{\partial x_{2} \partial x_{n}} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \frac{\partial^{2} f}{\partial x_{n} \partial x_{1}} & \frac{\partial^{2} f}{\partial x_{n} \partial x_{2}} & \dots & \frac{\partial^{2} f}{\partial x_{n}^{2}} \end{matrix}]$  $= \begin{bmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \dfrac{\partial^2 f}{\partial x_1\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_1\,\partial x_n} \\[2.2ex] \dfrac{\partial^2 f}{\partial x_2\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_2^2} & \cdots & \dfrac{\partial^2 f}{\partial x_2\,\partial x_n} \\[2.2ex] \vdots & \vdots & \ddots & \vdots \\[2.2ex] \dfrac{\partial^2 f}{\partial x_n\,\partial x_1} & \dfrac{\partial^2 f}{\partial x_n\,\partial x_2} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2}\end{bmatrix}$

nifty tricks

$x^TAx = tr(xx^TA) = \sum_{i, j} x_iA_{i, j} x_j$
$tr(AB)$ = sum of elementwise-products
if X, Y symmetric, $tr(YX) = tr(Y \sum \lambda_i q_i q_i^T)$
$A=UDV^T = \sum_i \sigma_i u_i v_i^T \implies A^{-1} = VD^{-1} U^T$

feature engineering

what is x?
what is y?
$\phi(x)$ can be treated like $x$

lin. regression setup

n = number of data points

d = dimension of each data point

n {\underset{1}{\underset{⏟}{[\begin{matrix} ⋮ \\ y \\ ⋮ \end{matrix}]}} = n {\underset{d}{\underset{⏟}{[\begin{matrix} ⋮ \\ \dots X \dots \\ ⋮ \end{matrix}]}} \underset{1}{\underset{⏟}{[\begin{matrix} ⋮ \\ w \\ ⋮ \end{matrix}]}}} d

$n\left\{\vphantom{\begin{bmatrix} X \\ \vdots \\. \end{bmatrix}}\right. \underbrace{ \begin{bmatrix} \vdots \\ y \\ \vdots \end{bmatrix}}_{\displaystyle 1} = n\left\{\vphantom{\begin{bmatrix} X \\ \vdots \\. \end{bmatrix}}\right. \underbrace{ \begin{bmatrix} \vdots \\ \cdots X \cdots\\ \vdots \end{bmatrix}}_{\displaystyle d} \vphantom{\begin{bmatrix} X \\ \vdots\\c \end{bmatrix}} \underbrace{ \begin{bmatrix} \smash{\vdots} \\ w \\ \smash\vdots \end{bmatrix}}_{\displaystyle 1} \left.\vphantom{\begin{bmatrix} X \\ \smash \vdots \\. \end{bmatrix}}\right\}d$

regularization

$\mathbf{\hat{y}} = \mathbf{X} \mathbf{\hat{w}}$

Model	Loss
OLS	$\vert \vert y - \hat{y} \vert \vert^2$
Ridge	$\vert \vert y - \hat{y} \vert \vert^2 + \lambda \vert\vert \hat w\vert\vert_2^2$
Lasso	$\vert \vert y - \hat{y} \vert \vert^2 + \lambda \vert\vert \hat w\vert\vert_1$
Elastic Net	$\vert \vert y - \hat{y} \vert \vert^2 + \lambda_1 \vert\vert \hat w\vert\vert_1+ \lambda_2\vert\vert \hat w\vert\vert_2^2$

ols solution

$\hat{w}_{OLS} = (X^TX)^{-1}X^Ty$
2 derivations: least squares, orthogonal projection

ridge regression intuition

$\hat w_{RIDGE} = (X^TX \color{red}{+ \lambda I})^{-1}X^Ty$ 1d8XV

probabilistic model

assume a true underlying model
ex. $Y_i \sim \mathcal N(\theta^TX_i, \sigma^2)$
this is equivalent to $P(Y_i|X_i; \theta) = \mathcal N(\theta^TX_i, \sigma^2)$

bayes rule

$\overbrace{p(\theta \vert x)}^{\text{posterior}} = \frac{\overbrace{p(x\vert\theta)}^{\text{likelihood}} \overbrace{p(\theta)}^{\text{prior}}}{p(x)}$ bayes2

likelihood

$\mathcal L = p(data | \theta)$ ~ product over all n examples

$p(x|\theta)$ ?
$p(y|x; \theta)$ ?
$p(x, y | \theta)$ ?
$\to$ depends on the problem + model

mle - maximum likelihood estimation

$\hat{\theta}_{MLE} = \underset{\theta}{\text{argmax}} \: \mathcal{L}$
associated with frequentist school of thought

map - maximum a posteriori

- $p(x)$ disappears because it doesn't depend on $\theta$
associated with bayesian school of thought

mle vs. map

$\hat{\theta}_{MLE} = \underset{\theta}{argmax} \: \overbrace{p(x|\theta)}^{\text{likelihood}}$
- $\hat{\theta}_{\text{Bayes}} = E_\theta \: p(\theta|x)$

bias

bias of a model:
- expectation over drawing new training sets from same distr.
could also have bias of a point estimate: $E[\hat{\theta} - \theta]$

variance

"estimation error"
variance of a model:
- expectation over training sets with fixed x

bias-variance trade-off

mean-squared error of model:
- = bias $^2$ + variance
- = $E[\hat{f}(x) - f(x)]^2$ + $E[(\hat{f(x)} - E[\hat{f(x)}])^2]$

biasvariance

definitions

- $\mu$ is mean vector
- $\Sigma$ is covariance matrix

MultivariateNormal

mle gaussian estimation

$\hat \mu, \hat \Sigma = argmax \: P(x_1, ..., x_n|\mu, \Sigma)$
$\hat \mu = \frac{1}{n} \sum x_i$
$\hat \Sigma = \frac{1}{n} \sum (x_ i - \hat \mu)(x_i - \hat \mu)^T$

weighted least squares

weight certain points more $\omega_i$
$\hat{w}_\text{wls} = argmin \left( \sum \omega_i (y_i - x_i^T w)^2\right)$
$= (X^T\Omega X)^{-1}X^T\Omega y$

total LS solution

- here, $\sigma$ is last singular value of $[X \: y]$

pca in python

X -= np.mean(X, axis=0) #zero-center data (nxd) 
cov = np.dot(X.T, X) / X.shape[0] #get cov. matrix (dxd) 
U, D, V = np.linalg.svd(cov) #compute svd, (all dxd) 
X_2d = np.dot(X, U[:, :2]) #project in 2d (nx2)

pca in practice

eigenvalue represents prop. of explained variance: $\sum \lambda_i = tr(\Sigma) = \sum Var(X_i)$
use svd
adaptive PCA is faster (sequential)

ex. ridge regression

reformulate the problem to be computationally efficient + nonlinear
- matrix inversion is ~ $O(dim^3)$
$\hat{w} = (\color{red}{\underbrace{X^TX}_{dxd}} + \lambda I)^{-1}X^Ty$ ~ faster when $\color{red}{d << n}$
$\hat{w} = X^T(\color{red}{\underbrace{XX^T}_{nxn}} + \lambda I)^{-1}y$ ~ faster when $\color{red}{n << d}$

kernels

Screen Shot 2018-06-24 at 9.53.55 PM

$\phi_i^T\phi_j = \phi(x_i)^T \phi(x_j)$

ex. kernel ridge regression

linear kernel: $\widehat{w} = X^T(XX^T + \lambda I)^{-1}y$
generic kernel:
- at test time, $\widehat{y}(x) = \phi(x) \mathbf \phi^T(\mathbf \phi\mathbf \phi^T + \lambda I)^{-1}y$
- only requires kernel products!

kernel trick ex.

$\mathbf{x} = [x_1, x_2]$
$\phi(\mathbf x) = \begin{bmatrix} x_1^2 & x_2^2 &\sqrt{2}x_1x_2 & \sqrt{2}x_1 & \sqrt{2}x_2 &1\end{bmatrix}^T$

$k(\mathbf{x}, \mathbf{z}) = \underbrace{\phi(\mathbf x)^T \phi (\mathbf z)}_{\text{O(augmented feature space)}} = \underbrace{(\mathbf x^T \mathbf z+ 1)^2}_{\text{O(original feature space + log(degree))}}$

another ex. rbf kernel: $k(\mathbf x, \mathbf z) = \exp(-\gamma \vert \vert \mathbf x - \mathbf z \vert \vert ^2 )$

overview

minimizing things
ex. $\underset{\theta}{arg min} \: \sum \big(y_i - f(x_i; \theta)\big)^2$

loss_surfaces

convexity

Hessian $\nabla^2 f(x) \succeq 0 \: \forall x$
$f(x_2) \geq f(x_1) + \nabla f(x_1) (x_2 - x_1)$

convexity continued

$\color{purple}{t f(x_1) + (1-t) f(x_2)} \geq f(tx_1 + (1-t)x_2)$

strong convexity + smoothness

$0 \preceq \underset{\text{strong convexity}}{mI} \preceq \nabla^2 f(x) \preceq \underset{\text{smoothness}}{MI}$

bounds

smoothness

M-smooth = Lipschitz continuous gradient: $||\nabla f(x_2) - \nabla f(x_1)|| \leq M||x_2-x_1||\quad \forall x_1,x_2$

Lipschitz continuous f	M-smooth

momentum demo

$θ^{(t + 1)} = θ^{(t)} - α_{t} \nabla f (θ^{(t)}) + \underset{momentum}{β_{t} (f (θ^{(t)}) - f (θ^{(t - 1)}))}$ $\theta^{(t+1)} = \theta^{(t)} - \alpha_t \nabla f(\theta^{(t)}) + \color{cornflowerblue}{\underset{\text{momentum}}{\beta_t (f(\theta^{(t)}) - f(\theta^{(t-1)}))}}$

newton-raphson

slide_8

apply to find roots of f'(x): $\theta^{(t+1)} = \theta^{(t)} - \nabla^2 f(\theta^{(t)})^{-1}\nabla f(\theta^{(t)})$

gauss-newton

modify newton's method assuming we are minimizing nonlinear least squares
$\theta^{(t+1)} = \theta^{(t)} - \nabla^2 f(\theta^{(t)})^{-1}\nabla f(\theta^{(t)})$
$\theta^{(t+1)} = \theta^{(t)} + \color{cadetblue}{(J^TJ)}^{-1} \color{cadetblue}{J^T\Delta y}$ $\quad J$ is the Jacobian

training a perceptron

loss function: $L(x, y; w) = (\hat{y} - y)^2$
goal: $\frac{\partial L}{\partial w_i}$ for all weights
calculate efficiently with backprop

coding DNNs in numpy

from numpy import exp, array, random
X = array([[0, 0, 1], [1, 1, 1], [1, 0, 1], [0, 1, 1]])
Y = array([[0, 1, 1, 0]]).T
w = 2 * random.random((3, 1)) - 1
for iteration in range(10000):
    Yhat = 1 / (1 + exp(-(X @ w)))
    w += X.T @ (Y - Yhat) * Yhat * (1 - Yhat)
print(1 / (1 + exp(-(array([1, 0, 0] @ w))))

coding DNNs in advanced numpy

import tensorflow as tf
import torch

definitions

discriminative_vs_generative

discriminative: $p(y|x)$
generative: $p(x, y) = p(x|y) p(y)$

bayes classifier

risk: $\mathbb E_{(X,Y)}[L(f(x), y) ] = \sum_x p(x) \sum_y L(f(x), y) p(y|x)$
bayes classifier:
- given x, pick y that minimizes risk

bayes classifier example

with 0-1 error:
- let y be sentiment (positive or negative)
- let x be words

definitions

Exam_pass_logistic_curve

$\sigma(z) = \frac{1}{1+e^{-z}}$
- threshold to predict
- not really regression

loss functions

log-loss = cross-entropy:
- $p(x)$ true $y$
- $q(x)$ predicted probability of y
corresponds to MLE for Bernoulli

Screen Shot 2018-07-02 at 11.26.42 AM

multiclass

one-hot encoding: $[1, 0, 0]$ , $[0, 1, 0]$ , $[0, 0, 1]$
softmax function: $\sigma(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_j \exp z_j}$
loss function still cross-entropy

multihierarchy

training

no closed form, but convex loss convex optimization!
- minimize loss on cross-entropy (where p(x) is modelled by sigmoid)
- or maximize likelihood

assumptions

- $P(\mathbf{x}|y)\sim \mathcal N (\mathbf \mu_y, \Sigma_y)$ : there are |Y| of these
- $p(y) = \frac{n_y}{n}$ : 1 of these

EM

want to maximize complete log-likelihood $l (\theta; x, z) = log \: p(x,z|\theta)$ but don't know latent z

expectation step - values of z filled in
maximization step - parameters are adjusted based on z

simplifying the math

$x$ : observed vars, $z$ : latent vars, $q$ : assignments to z

E:
- lower bound on complete log-likelihood (pf: Jensen's inequality)
M: $\theta^{(t+1)} = \underset{\theta}{argmin} \: D(q^{(t+1)} || \theta)$

how far are points?

decision boundary: { $x: w^Tx - b = 0$ }

$D = \frac{|w^T(z-x_0)|}{||w||_2} = \frac{|w^Tz-b|}{||w||_2}$

svm_proj

hard margin formulation

let $m = 1 / ||w||_2 \implies$ unique soln

$\underset{w, b}{\min}\quad \frac{1}{2} ||w||_2^2 \\s.t. \quad y_i (w^Tx_i - b) \geq 1\: \forall i$

${\color{cadetblue}{\text{soft}}}$ margin

$\begin{align}\underset{w, b, \color{cadetblue}\xi}{\min}\quad &\frac{1}{2} ||w||_2^2 \color{cadetblue}{+C\sum_i \xi_i}\\s.t. \quad &y_i (w^Tx_i + b) \geq 1 \color{cadetblue}{-\xi_i}\: \forall i\\ &\color{cadetblue}{\xi_i \geq 0 \: \forall i}\end{align}$

errs

binary classification

can rewrite by absorbing $\xi$ constraints

$\underset{w, b}{\min} \quad \frac{1}{2}||w||^2 + C\sum_i \max(1-y_i(w^Tx_i - b), 0)$

binary classification

Model	$\mathbf{\hat{\theta}}$ objective (minimize)
Perceptron	$\sum_i \max(0, -y_i \cdot \theta^T x_i)$
Linear SVM	$\theta^T\theta + C \sum_i \max(0,1-y_i \cdot \theta^T x_i)$
Logistic regression	$\theta^T\theta + C \sum_i \log[1+\exp(-y_i \cdot \theta^T x_i)]$

problem

primal

$p^* = \min \: f_0 (x) \\ s.t. \: f_i(x) \leq 0 \\ h_i(x) = 0$

dual

 $d^* = \underset{\lambda, \nu}{\max} \: \overbrace{\underset{x}{\inf} \: \underbrace{f_0(x) + \sum \lambda_i f_i(x) + \sum \nu_i h_i(x)}_{\text{Lagrangian} \: L(x, \lambda, \nu)}}^{\text{dual function} \: g(\lambda, \nu)} \\s.t. \: \lambda \succeq 0\\$

comments

dual function $g(\lambda, \nu)$ always concave
- $\lambda \succeq 0 \implies g(\lambda, \nu) \leq p^*$
$(\lambda, \nu)$ dual feasible if
1. $\lambda \succeq 0$
2. $(\lambda, \nu) \in dom \: g$