Deep learning theory is a complex emerging field - this post contains links displaying some different interesting research directions

# 1.6. dl theoryÂ¶

off convex blog

## 1.6.1. theoretical studiesÂ¶

DNNs display many surprising properties

Many things seem to contribute to the inductive bias of DNNs: SGD, dropout, early stopping, resnets, convolution, more layersâ€¦all of these are tangled together and many things correlate with generalization errorâ€¦what are the important things and how do they contribute?

some more concrete questions:

• what is happening when training err stops going down but val err keeps going down (interpolation regime)?

• what are good statistical markers of an effectively trained DNN?

• how far apart are 2 nets?

### 1.6.1.1. functional approximationÂ¶

• dnns are very hard to study in the parameter space (e.g. swapping two parameters changes things like the Hessian), easier to to study in the function space (e.g. the input-output relationship)

• nonlinear approximation (e.g. sparse coding) - 2 steps

1. construct a dictionary function (T)

2. learn linear combination of the dictionary elements (g)

• background

• $$L^2$$ function (or function space) is square integrable: $$|f|^2 = \int_X |f|^2 d\mu$$, and $$|f|$$ is its $$L_2$$-norm

• Hilbert space - vector space w/ additional structure of inner product which allows length + angle to be measured

• complete - there are enough limits in the space to allow calculus techniques (is a complete metric space)

• composition allows an approximation of a function through level sets (split it up and approximate on these sets) - Zuowei Shen (talk, slides)

• composition operation allows an approximation of a function f through level sets of f â€“

• one divides up the range of f into equal intervals and approximate the functions on these sets

• nonlinear approximation via compositions (shen 2019)

• how do the weights in each layer help this approximation to be more effective?

• here are some thoughts â€“ If other layers are like the first layer, the weights â€śwhitenâ€ť or make the inputs more independent or random projections â€“ that is basically finding PC directions for low-rank inputs.

• are the outputs from later layers more or less low - rank?

• I wonder how this â€śwhiteningâ€ť helps level set estimationâ€¦

• takagi functions

• nonlinear approximation and (deep) relu nets - also comes with slides

### 1.6.1.2. inductive bias=implicit regularization: gradient descent finds good minimaÂ¶

DL learns solutions that generalize even though it can find many which donâ€™t due to its inductive bias.

• gd bousquet paper

• in high dims, local minima are usually saddles (ganguli)

• early stopping is very similar to ridge regression

• ridge regression: soln will lie in $$p-dim$$ row-space of X (span of the rows)

• when $$\lambda \to 0$$, will give us min-norm soln (basically because we project onto $$col(X)$$)

• early stopping in least squares

• if we initialize at 0, GD soln will always be in the row-space of X

• GD will converge to min-norm soln (any soln not in the row-space will necessarily have larger norm)

• similar to ridge (endpoints are the same) and can related their risks very closely when $$\lambda = 1/t$$, where $$t$$ is GD time iterate

• assume gradient flow: take step-size to 0

• srebro understanding over-parameterization

• ex. gunasekar et al 2017: unconstrained matrix completion

• grad descent on U, V yields min nuclear norm solution

• sgd on logistic reg. gives hard margin svm

• deep linear net gives the same thing - doesnâ€™t actually changed anything

• linear convnets give smth better - minimum l1 norm in discrete fourier transform

• ex. savarese 2019

• infinite width relu net 1-d input

• weight decay minimization minimizes derivative of TV

• implicit bias towards simpler models

• How do infinite width bounded norm networks look in function space? (savareseâ€¦srebro 2019)

• minimal norm fit for a sample is given by a linear spline interpolation (2 layer net)

• analytic theory of generalization + transfer (ganguli 19)

• datasets for measuring causality

• Inferring Hidden Statuses and Actions in Video by Causal Reasoning - about finding causality in the video, not interpretation

• PL condition = Polyak-Lojawsiewicz condition guarantees global convergence of loca methods

• $$||\nabla f(x)||^2 \geq \alpha f(x) \geq 0$$

### 1.6.1.7. kernelsÂ¶

• To understand deep learning we need to understand kernel learning - overfitted kernel classifiers can still fit the data well

• original kernels (neal 1994) + (lee et al. 2018) + (matthews et al. 2018)

• infinitely wide nets and only top layer is trained

• corresponds to kernel $$\text{ker}(x, x') = \mathbb E_{\theta \sim W}[f(\theta, x) \cdot f(\theta, x')]$$, where $$W$$ is an intialization distr. over $$\theta$$

• neural tangent kernel (jacot et al. 2018)

• $$\text{ker}(x, x') = \mathbb E_{\theta \sim W}[\left < \frac{f(\theta, x)}{\partial \theta} \cdot \frac{f(\theta, x')}{\partial \theta} \right> ]$$ - evolution of weights over time follows this kernel

• with very large width, this kernel is the NTK at initialization

• stays stable during training (since weights donâ€™t change much)

• at initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit

• evolution of an ANN during training can also be described by a kernel (kernel gradient descent)

• different types of kernels impose different things on a function (e.g. want more / less low frequencies)

• gradient descent in kernel space can be convex if kernel is PD (even if nonconvex in the parameter space)

• understanding the neural tangent kernel (arora et al. 2019)

• method to compute the kernel quickly on a gpu

• Scaling description of generalization with number of parameters in deep learning (geiger et al. 2019)

• number of params = N

• above 0 training err, larger number of params reduces variance but doesnâ€™t actually help

• ensembling with smaller N fixes problem

• the improvement of generalization performance with N in this classification task originates from reduced variance of fN when N gets large, as recently observed for mean-square regression

• On the Inductive Bias of Neural Tangent Kernels (bietti & mairal 2019)

• Kernel and Deep Regimes in Overparametrized Models (Woodworthâ€¦Srebro 2019)

• transition between kernel and deep regimes

• The HSIC Bottleneck: Deep Learning without Back-Propagation (Ma et al. 2019)

• directly optimize information bottleneck (approximated by HSIC) yields pretty good results

### 1.6.1.9. implicit dl + optimizationÂ¶

• implicit deep learning (el ghaoui et al. 2019)

• $$\hat y (u) = Cx + D u$$, where $$x = \phi(Ax + Bu)$$

• here, $$u$$ is a new input

• $$x$$ is a hidden state which represents some hidden features (which depends on $$u$$)

• well-posedness - want x to be unique for a given u

• $$A, B$$ are matrices which let us compute $$x$$ given $$u$$

• $$C, D$$ help us do the final prediction (like the final linear layer)

• ex. feedforward nets

• consider net with $$L$$ layers

• $$x_0 = u$$

• $$x_{l + 1} = \phi_l (W_l x_l)$$

• $$\hat y (u) = W_L x_L$$

• rewriting in implicit form

• $$x = (x_L, ..., x_1)$$ - concatenate all the activations into one big vector

• ex. $$Ax + Bu= \begin{bmatrix} W_{L-1}x_{L-1} \\ W_{L-2} x_{L-2} \\ \vdots \\ W_1x_1 \\ \mathbf 0\end{bmatrix} + \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \\ W_0 u \end{bmatrix}$$

• lifted neural networks (askari et al. 2018)

• can solve dnn $$\hat y = \phi(W_2 \phi (W_1X_0))$$ by rewriting using constraints:

• $$X_1 = \phi(W_1 X_0)$$

• $$X_2 = \phi(W_2 X_1)$$

• \begin{align} &\min (y - \hat y)^2\\s.t. X_1 &= \phi(WX_0)\\X_2 &= \phi(WX_1)\end{align}

• can be written using Lagrangian multipliers: $$\min (y - \hat y)^2 + \lambda_1( X_1 - \phi(WX_0)) + \lambda_2(X_2 - \phi(WX_1))$$

• in the lifted setting above, can replace Lagrangian with simpler expression using Fenchel conjugates

• robust optimization basics

• immunize optimization problems against uncertainty in the data

• do so by having worst-case constraints (e.g. $$a < 5, \forall a$$)

• local robustness - maximize radius surrounding parameter subject to all constraints (no objective to maximize)

• global robustness - maximize objective subject to robustness constraint (trades off robustness with objective value)

• non-probabilistic robust optimization models (e.g. Waldâ€™s maximin model: $$\underset{x}{\max} \underset{u}{\min} f(x, u)$$

• also are probabilistically robust optimization

• distributional robustness - using moments in the dl work

• ch 4 and ch10 of robust optimization book (bental, el ghaoui, & nemirovski 2009)

• Certifying Some Distributional Robustness with Principled Adversarial Training (sinha, namkoong, & duchi 2018)

• want to guarantee performance under adversarial input perturbations

• considering a Lagrangian penalty formulation of perturbing the underlying data distribution in a Wasserstein ball

• during training, augments model parameter updates with worst-case perturbations of training data

• little extra cost and achieves guarantees for smooth losses

• On Distributionally Robust Chance-Constrained Linear Programs (calafiore & el ghaoui 2006)

• linear programs where data (in the constraints) is random

• want to enforce the constraints up to a given prob. level

• can convert the prob. constraints into convex 2nd-order cone constraints

• under distrs. for the random data, can guarantee constraints

• Differentiable Convex Optimization Layers (agrawal et al. 2019)

### 1.6.1.10. statistical physicsÂ¶

• Statistical Mechanics of Deep Learning (bahri et al. 2019)

• what is the advantage of depth - connect to dynamical phase transitions

• there are several function which require only polynomial nodes in each layer for deep nets, but exponential for shallow nets

• what is the shape of the loss landscape - connect to random Gaussian processes, sping glasses, and jamming

• how to pick a good parameter initialization?

• bounding generalization error

• often, generalization bounds take the form $$\epsilon_{test} \leq \epsilon_{train} + \frac {\mathcal C (\mathcal F)} p$$, where $$\mathcal C (\mathcal F)$$ is the complexity of a function class and $$p$$ is the number of examples

• alternative framework: algorithmic stability - will generalize if map is stable wrt perturbations of the data $$\mathcal D$$

• altenative: PAC bounds suggest if distr. of weights doesnâ€™t change much during training, generalization will be succesful

• deep linear networks

• student learns biggest singular values of the input-output correlation matrix $$\Sigma = \sum_i y_i x_i^T$$, so it learns the important stuff first and the noise last

• infinite-width limit

• if parameters are random, indcues a prior distribution $$P(\mathcal F)$$ over the space of functions

• in th limit of infinite width, this prior is Gaussian, with a specific correlation kernel

• learning is similar to learning the Bayesian posterior $$P(f|data)$$, but connecting this to sgd is still not clear

• instead, assumptions on symmetries embedded in a networkâ€™s architecture constrain training dynamics

• similar to Noetherâ€™s thm in physics

• can much better analytically describe learning dynamics

• weights have a differentiable symmetry in the loss if the loss doesnâ€™t change under a certain differentiable transformation of the weights

• ex. translation symmetry - for layer before softmax, we get symmetries across weights that shift all outputs

• ex. scale symmetry - inputs to batch normalization are invariant to scaling

• ex. rescale symmetry - scaling up/down 2 things which multiply or add

• modeling discretization

## 1.6.2. empirical studiesÂ¶

### 1.6.2.1. interesting empirical papersÂ¶

• robustness may be at odds with accuracy (madry 2019)

• adversarial training helps w/ little data but hurts with lots of data

• adversarially trained models have more meaningful gradients (and their adversarial examples actually look like other classes)

• Generalizability vs. Robustness: Adversarial Examples for Medical Imaging

• Towards Robust Interpretability with Self-Explaining Neural Networks

• Adversarial examples are inputs to machine learning models that an attacker intentionally designed to cause the model to make mistakes

• threat models

• poisoning attack (insert fake samples into training data) vs. evasion attack (just evade at test time)

• targeted attack (want specific class) vs. non-targeted attack (just change the prediction)

• white-box - adversary knows everything

• black-box - can only feed inputs and get outputs

• gray-box - might have white box for limited amount of time

• security evaluation

• robustness - minimum norm perturbation to change class

• adversarial loss - biggest change in loss within some epsilon ball

• AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty (hendrycks et al. 2020)

• do a bunch of transformations and average images to create each training image

• certifying robustness

• bound gradients of f around x - solve SDP for 2-layer net (raghunathan, steinhardt, & liang 2018, iclr)

• relax SDP and then applies to multilayer nets (raghunathan et al. 2018)

• improved optimizer for the SDP (dathathri et al. 2020)

• interval bound propagation - instead of passing point, pass an interval where it could be

• works well for word substitions (jia et al. 2019)

• RobEn - cluster words that are confusable + give them the same encoding when you train (jones et al. 2020)

• then you are robust to confusing the words

• unlabeled data + self-training helps

• training robust models has higher sample complexity than training standard models

• tradeoff between robustness and accuracy (raghunathann et al. 2020)

• adding valid data can hurt even in well-specified, convex setting

• robust self-training eliminates this tradeoff in linear regression

• sample complexity can be reduced with unlabeled examples (carmon et al. 2019)

• distributional robust optimization

• instead of minimizing training err, minimize maximum training err over different perturbations

• hard to pick the perturbation set - can easily be too pessimistic

• these things possibly magnify disparities

• larger models

• selective classification

• feature noise

• removing spurious features

• group DRO (sagawa et al. 2020) - maximize error for worst group

• need to add regularization to keep errors from going to 0

• training overparameterized models makes this problem worse

• abstain from classifying based on confidence (jones et al. 2020)

• makes error rates worse for worst group = selective classification can magnify disparities

• adding feature noise to each group can magnify disparities

• standard domain adaptation: labeled (x, y) in source and unlabeled x in target

• gradual domain adaptation - things change slowly over time, can use gradual self-training (kumar et al. 2020)

• In-N-Out (xie et al. 2020) - if we have many features, rather than using them all as features, can use some as features and some as targets when we shift, to learn the domain shift

### 1.6.2.3. tools for analyzingÂ¶

• dim reduction: svcca, diffusion maps

• viz tools: bei wang

• visualizing loss landscape

• 1d: plot loss by extrapolating between 2 points (start/end, 2 ends)

• goodfellow et al. 2015, im et al. 2016

• exploring landscape with this technique

• 2d: plot loss on a grid in 2 directions

• important to think about scale invariance (dinh et al. 2017)

• want to scale direction vector to have same norm in each direction as filter

• use PCA to find important directions (ex. sample w at each step, pca to find most important directions of variance)

### 1.6.2.8. memorization backgroundÂ¶

• memorization on single training example

• memorization in dnns

• â€śmemorizationâ€ť as the behavior exhibited by DNNs trained on noise, and conduct a series of experiments that contrast the learning dynamics of DNNs on real vs. noise data

• look at critical samples - adversarial exists nearby

• networks that generalize well have deep layers that are approximately linear with respect to batches of similar inputs

• networks that memorize their training data are highly non-linear with respect to similar inputs, even in deep layers

• expect that with respect to a single class, deep layers are approximately linear

• example forgetting paper

• secret sharing

• memorization in overparameterized autoencoders

• autoencoders donâ€™t lean identity, but learn projection onto span of training examples = memorization of training examples

• sometimes output individual training images, not just project onto space of training images

• https://arxiv.org/pdf/1810.10333.pdf

• https://arxiv.org/pdf/1909.12362.pdf

• https://pdfs.semanticscholar.org/a624/6278cb5e2d0ab79fe20fe20a41c586732a11.pdf

• Auto-encoders: reconstruction versus compression

### 1.6.2.10. architecture search backgroundÂ¶

• ideas: nested search (retrain each time), joint search (make arch search differentiable), one-shot search (train big net then search for subnet)

## 1.6.3. basicsÂ¶

• good set of class notes

• demos to gain intuition

• colah

• overview / reviews

• some people involved: nathan srebro, sanjeev arora, jascha sohl-dickstein, tomaso poggio, stefano soatto, ben recht, olivier bousquet, jason lee, simon shaolei du

• interpretability: cynthia rudin, rich caruana, been kim, nicholas papernot, finale doshi-velez

• neuro: eero simoncelli, haim sompolinsky