| Section | Topic |
|---|---|
| general | intro, linear algebra, gaussian, parameter estimation, bias-variance |
| regression | lin reg, LS, kernels, sparsity |
| dim reduction | dim reduction |
| classification | discr. vs. generative, nearest neighbor, DNNs, log. regression, lda/qda, decision trees, svms |
| optimization | problems, algorithms, duality, boosting, em |

equivalent to the triangle inequality




nxp matrix:
function f:
function f:
=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂2f∂x21∂2f∂x2∂x1⋮∂2f∂xn∂x1∂2f∂x1∂x2∂2f∂x22⋮∂2f∂xn∂x2⋯⋯⋱⋯∂2f∂x1∂xn∂2f∂x2∂xn⋮∂2f∂x2n⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
xTAx=tr(xxTA)=∑i,jxiAi,jxj

n = number of data points
d = dimension of each data point

| Model | Loss |
|---|---|
| OLS |
|
| Ridge |
|
| Lasso |
|
| Elastic Net |
|
θ̂ Bayes=Eθp(θ|x)








add i.i.d. gaussian noise in x and y - regularization

orthogonal dimensions that maximize variance of

X -= np.mean(X, axis=0) #zero-center data (nxd)
cov = np.dot(X.T, X) / X.shape[0] #get cov. matrix (dxd)
U, D, V = np.linalg.svd(cov) #compute svd, (all dxd)
X_2d = np.dot(X, U[:, :2]) #project in 2d (nx2)
invariant to scalings / affine transformations of X, Y




yˆh(x)=∑ni=1Kh(x−xi)yi∑nj=1Kh(x−xj)

Hessian


M-smooth = Lipschitz continuous gradient:
| Lipschitz continuous f | M-smooth |
|---|---|
![]() |
![]() |




also see nn demo playground

from numpy import exp, array, random
X = array([[0, 0, 1], [1, 1, 1], [1, 0, 1], [0, 1, 1]])
Y = array([[0, 1, 1, 0]]).T
w = 2 * random.random((3, 1)) - 1
for iteration in range(10000):
Yhat = 1 / (1 + exp(-(X @ w)))
w += X.T @ (Y - Yhat) * Yhat * (1 - Yhat)
print(1 / (1 + exp(-(array([1, 0, 0] @ w))))
import tensorflow as tf
import torch

















want to maximize complete log-likelihood
note 20 is good


decision boundary: {


can rewrite by absorbing

| Model |
|
|---|---|
| Perceptron |
|
| Linear SVM |
|
| Logistic regression |
|
p∗=minf0(x)s.t.fi(x)≤0hi(x)=0
d∗=maxλ,νinfxf0(x)+∑λifi(x)+∑νihi(x)LagrangianL(x,λ,ν)dual functiong(λ,ν)s.t.λ⪰0
dual function
weak duality:
strong duality:

minw||Xw−y||22s.t.||w||1≤k
minw||Xw−y||22s.t.||w||2≤k

lasso:
ridge:


maximize H(parent) - [weighted average]




sequentially train many weak learners to approximate a function