Chandan Singh | dl theory

dl theory

Deep learning theory is a complex emerging field - this post contains links displaying some different interesting research directions

off convex blog

# theoretical studies

DNNs display many surprising properties

Many things seem to contribute to the inductive bias of DNNs: SGD, dropout, early stopping, resnets, convolution, more layers…all of these are tangled together and many things correlate with generalization error…what are the important things and how do they contribute?

some more concrete questions:

• what is happening when training err stops going down but val err keeps going down (interpolation regime)?
• what are good statistical markers of an effectively trained DNN?
• how far apart are 2 nets?

## functional approximation

• dnns are very hard to study in the parameter space (e.g. swapping two parameters changes things like the Hessian), easier to to study in the function space (e.g. the input-output relationship)
• nonlinear approximation (e.g. sparse coding) - 2 steps
1. construct a dictionary function (T)
1. learn linear combination of the dictionary elements (g)
• background
•  $L^2$ function (or function space) is square integrable: $f ^2 = \int_X f ^2 d\mu$, and $f$ is its $L_2$-norm
• Hilbert space - vector space w/ additional structure of inner product which allows length + angle to be measured
• complete - there are enough limits in the space to allow calculus techniques (is a complete metric space)
• composition allows an approximation of a function through level sets (split it up and approximate on these sets) - Zuowei Shen (talk, slides)
• composition operation allows an approximation of a function f through level sets of f –
• one divides up the range of f into equal intervals and approximate the functions on these sets
• nonlinear approximation via compositions (shen 2019)
• how do the weights in each layer help this approximation to be more effective?
• here are some thoughts – If other layers are like the first layer, the weights “whiten” or make the inputs more independent or random projections – that is basically finding PC directions for low-rank inputs.
• are the outputs from later layers more or less low - rank?
• I wonder how this “whitening” helps level set estimation…
• takagi functions
• nonlinear approximation and (deep) relu nets - also comes with slides

## inductive bias=implicit regularization: gradient descent finds good minima

DL learns solutions that generalize even though it can find many which don’t due to its inductive bias.

• gd bousquet paper
• in high dims, local minima are usually saddles (ganguli)
• early stopping is very similar to ridge regression
• ridge regression: soln will lie in $p-dim$ row-space of X (span of the rows)
• when $\lambda \to 0$, will give us min-norm soln (basically because we project onto $col(X)$)
• early stopping in least squares
• if we initialize at 0, GD soln will always be in the row-space of X
• GD will converge to min-norm soln (any soln not in the row-space will necessarily have larger norm)
• similar to ridge (endpoints are the same) and can related their risks very closely when $\lambda = 1/t$, where $t$ is GD time iterate
• assume gradient flow: take step-size to 0
• srebro understanding over-parameterization
• ex. gunasekar et al 2017: unconstrained matrix completion
• grad descent on U, V yields min nuclear norm solution
• ex. soudry et al 2017
• sgd on logistic reg. gives hard margin svm
• deep linear net gives the same thing - doesn’t actually changed anything
• linear convnets give smth better - minimum l1 norm in discrete fourier transform
• ex. savarese 2019
• infinite width relu net 1-d input
• weight decay minimization minimizes derivative of TV
• implicit bias towards simpler models
• How do infinite width bounded norm networks look in function space? (savarese…srebro 2019)
• minimal norm fit for a sample is given by a linear spline interpolation (2 layer net)
• analytic theory of generalization + transfer (ganguli 19)
• datasets for measuring causality
• Inferring Hidden Statuses and Actions in Video by Causal Reasoning - about finding causality in the video, not interpretation
• PL condition = Polyak-Lojawsiewicz condition guarantees global convergence of loca methods
•  $\nabla f(x) ^2 \geq \alpha f(x) \geq 0$

## implicit dl + optimization

• implicit deep learning (el ghaoui et al. 2019)
• $\hat y (u) = Cx + D u$, where $x = \phi(Ax + Bu)$
• here, $u$ is a new input
• $x$ is a hidden state which represents some hidden features (which depends on $u$)
• well-posedness - want x to be unique for a given u
• $A, B$ are matrices which let us compute $x$ given $u$
• $C, D$ help us do the final prediction (like the final linear layer)
• ex. feedforward nets
• consider net with $L$ layers
• $x_0 = u$
• $x_{l + 1} = \phi_l (W_l x_l)$
• $\hat y (u) = W_L x_L$
• rewriting in implicit form
• $x = (x_L, …, x_1)$ - concatenate all the activations into one big vector
• ex. $Ax + Bu= \begin{bmatrix} W_{L-1}x_{L-1} \ W_{L-2} x_{L-2} \ \vdots \ W_1x_1 \ \mathbf 0\end{bmatrix} + \begin{bmatrix} 0 \ 0 \ \vdots \ 0 \ W_0 u \end{bmatrix}$
• lifted neural networks (askari et al. 2018)
• can solve dnn $\hat y = \phi(W_2 \phi (W_1X_0))$ by rewriting using constraints:
• $X_1 = \phi(W_1 X_0)$
• $X_2 = \phi(W_2 X_1)$
• \begin{align} &\min (y - \hat y)^2\s.t. X_1 &= \phi(WX_0)\X_2 &= \phi(WX_1)\end{align}
• can be written using Lagrangian multipliers: $\min (y - \hat y)^2 + \lambda_1( X_1 - \phi(WX_0)) + \lambda_2(X_2 - \phi(WX_1))$
• Fenchel Lifted Networks: A Lagrange Relaxation of Neural Network Training (gu et al. 2018)
• in the lifted setting above, can replace Lagrangian with simpler expression using Fenchel conjugates
• robust optimization basics
• immunize optimization problems against uncertainty in the data
• do so by having worst-case constraints (e.g. $a < 5, \forall a$)
• local robustness - maximize radius surrounding parameter subject to all constraints (no objective to maximize)
• global robustness - maximize objective subject to robustness constraint (trades off robustness with objective value)
• non-probabilistic robust optimization models (e.g. Wald’s maximin model: $\underset{x}{\max} \underset{u}{\min} f(x, u)$
• also are probabilistically robust optimization
• distributional robustness - using moments in the dl work
• ch 4 and ch10 of robust optimization book (bental, el ghaoui, & nemirovski 2009)
• Certifying Some Distributional Robustness with Principled Adversarial Training (sinha, namkoong, & duchi 2018)
• want to guarantee performance under adversarial input perturbations
• considering a Lagrangian penalty formulation of perturbing the underlying data distribution in a Wasserstein ball
• during training, augments model parameter updates with worst-case perturbations of training data
• little extra cost and achieves guarantees for smooth losses
• On Distributionally Robust Chance-Constrained Linear Programs (calafiore & el ghaoui 2006)
• linear programs where data (in the constraints) is random
• want to enforce the constraints up to a given prob. level
• can convert the prob. constraints into convex 2nd-order cone constraints
• under distrs. for the random data, can guarantee constraints
• Differentiable Convex Optimization Layers (agrawal et al. 2019)

## statistical physics

• Statistical Mechanics of Deep Learning (bahri et al. 2019)
• what is the advantage of depth - connect to dynamical phase transitions
• there are several function which require only polynomial nodes in each layer for deep nets, but exponential for shallow nets
• what is the shape of the loss landscape - connect to random Gaussian processes, sping glasses, and jamming
• how to pick a good parameter initialization?
• bounding generalization error
• often, generalization bounds take the form $\epsilon_{test} \leq \epsilon_{train} + \frac {\mathcal C (\mathcal F)} p$, where $\mathcal C (\mathcal F)$ is the complexity of a function class and $p$ is the number of examples
• alternative framework: algorithmic stability - will generalize if map is stable wrt perturbations of the data $\mathcal D$
• altenative: PAC bounds suggest if distr. of weights doesn’t change much during training, generalization will be succesful
• deep linear networks
• student learns biggest singular values of the input-output correlation matrix $\Sigma = \sum_i y_i x_i^T$, so it learns the important stuff first and the noise last
• infinite-width limit
• if parameters are random, indcues a prior distribution $P(\mathcal F)$ over the space of functions
• in th limit of infinite width, this prior is Gaussian, with a specific correlation kernel
•  learning is similar to learning the Bayesian posterior $P(f data)$, but connecting this to sgd is still not clear

# empirical studies

## tools for analyzing

• dim reduction: svcca, diffusion maps
• viz tools: bei wang
• visualizing loss landscape
• 1d: plot loss by extrapolating between 2 points (start/end, 2 ends)
• goodfellow et al. 2015, im et al. 2016
• exploring landscape with this technique
• 2d: plot loss on a grid in 2 directions
• important to think about scale invariance (dinh et al. 2017)
• want to scale direction vector to have same norm in each direction as filter
• use PCA to find important directions (ex. sample w at each step, pca to find most important directions of variance)

## memorization background

• memorization on single training example
• memorization in dnns
• “memorization” as the behavior exhibited by DNNs trained on noise, and conduct a series of experiments that contrast the learning dynamics of DNNs on real vs. noise data
• look at critical samples - adversarial exists nearby
• networks that generalize well have deep layers that are approximately linear with respect to batches of similar inputs
• networks that memorize their training data are highly non-linear with respect to similar inputs, even in deep layers
• expect that with respect to a single class, deep layers are approximately linear
• example forgetting paper
• secret sharing
• memorization in overparameterized autoencoders
• autoencoders don’t lean identity, but learn projection onto span of training examples = memorization of training examples
• sometimes output individual training images, not just project onto space of training images
• https://arxiv.org/pdf/1810.10333.pdf
• https://arxiv.org/pdf/1909.12362.pdf
• https://pdfs.semanticscholar.org/a624/6278cb5e2d0ab79fe20fe20a41c586732a11.pdf
• Auto-encoders: reconstruction versus compression

## architecture search background

• ideas: nested search (retrain each time), joint search (make arch search differentiable), one-shot search (train big net then search for subnet)

# basics

• good set of class notes
• demos to gain intuition
• colah
• overview / reviews

• some people involved: nathan srebro, sanjeev arora, jascha sohl-dickstein, tomaso poggio, stefano soatto, ben recht, olivier bousquet, jason lee, simon shaolei du
• interpretability: cynthia rudin, rich caruana, been kim, nicholas papernot, finale doshi-velez
• neuro: eero simoncelli, haim sompolinsky