Chandan Singh | dl theory

dl theory

Deep learning theory is a complex emerging field - this post contains links displaying some different interesting research directions

off convex blog

# theoretical studies

DNNs display many surprising properties

Many things seem to contribute to the inductive bias of DNNs: SGD, dropout, early stopping, resnets, convolution, more layers…all of these are tangled together and many things correlate with generalization error…what are the important things and how do they contribute?

some more concrete questions:

• what is happening when training err stops going down but val err keeps going down (interpolation regime)?
• what are good statistical markers of an effectively trained DNN?
• how far apart are 2 nets?

## functional approximation

• dnns are very hard to study in the parameter space (e.g. swapping two parameters changes things like the Hessian), easier to to study in the function space (e.g. the input-output relationship)
• nonlinear approximation (e.g. sparse coding) - 2 steps
1. construct a dictionary function (T)
1. learn linear combination of the dictionary elements (g)
• background
•  $L^2$ function (or function space) is square integrable: $f ^2 = \int_X f ^2 d\mu$, and $f$ is its $L_2$-norm
• Hilbert space - vector space w/ additional structure of inner product which allows length + angle to be measured
• complete - there are enough limits in the space to allow calculus techniques (is a complete metric space)
• composition allows an approximation of a function through level sets (split it up and approximate on these sets) - Zuowei Shen (talk, slides)
• composition operation allows an approximation of a function f through level sets of f –
• one divides up the range of f into equal intervals and approximate the functions on these sets
• nonlinear approximation via compositions (shen 2019)
• how do the weights in each layer help this approximation to be more effective?
• here are some thoughts – If other layers are like the first layer, the weights “whiten” or make the inputs more independent or random projections – that is basically finding PC directions for low-rank inputs.
• are the outputs from later layers more or less low - rank?
• I wonder how this “whitening” helps level set estimation…
• takagi functions
• nonlinear approximation and (deep) relu nets - also comes with slides

## inductive bias=implicit regularization: gradient descent finds good minima

DL learns solutions that generalize even though it can find many which don’t due to its inductive bias.

• gd bousquet paper
• in high dims, local minima are usually saddles (ganguli)
• early stopping is very similar to ridge regression
• ridge regression: soln will lie in $p-dim$ row-space of X (span of the rows)
• when $\lambda \to 0$, will give us min-norm soln (basically because we project onto $col(X)$)
• early stopping in least squares
• if we initialize at 0, GD soln will always be in the row-space of X
• GD will converge to min-norm soln (any soln not in the row-space will necessarily have larger norm)
• similar to ridge (endpoints are the same) and can related their risks very closely when $\lambda = 1/t$, where $t$ is GD time iterate
• assume gradient flow: take step-size to 0
• srebro understanding over-parameterization
• ex. gunasekar et al 2017: unconstrained matrix completion
• grad descent on U, V yields min nuclear norm solution
• ex. soudry et al 2017
• sgd on logistic reg. gives hard margin svm
• deep linear net gives the same thing - doesn’t actually changed anything
• linear convnets give smth better - minimum l1 norm in discrete fourier transform
• ex. savarese 2019
• infinite width relu net 1-d input
• weight decay minimization minimizes derivative of TV
• implicit bias towards simpler models
• How do infinite width bounded norm networks look in function space? (savarese…srebro 2019)
• minimal norm fit for a sample is given by a linear spline interpolation (2 layer net)
• analytic theory of generalization + transfer (ganguli 19)
• datasets for measuring causality
• Inferring Hidden Statuses and Actions in Video by Causal Reasoning - about finding causality in the video, not interpretation
• PL condition = Polyak-Lojawsiewicz condition guarantees global convergence of loca methods
•  $\nabla f(x) ^2 \geq \alpha f(x) \geq 0$

## kernels

• To understand deep learning we need to understand kernel learning - overfitted kernel classifiers can still fit the data well
• original kernels (neal 1994) + (lee et al. 2018) + (matthews et al. 2018)
• infinitely wide nets and only top layer is trained
• corresponds to kernel $\text{ker}(x, x’) = \mathbb E_{\theta \sim W}[f(\theta, x) \cdot f(\theta, x’)]$, where $W$ is an intialization distr. over $\theta$
• neural tangent kernel (jacot et al. 2018)
• $\text{ker}(x, x’) = \mathbb E_{\theta \sim W}[\left < \frac{f(\theta, x)}{\partial \theta} \cdot \frac{f(\theta, x’)}{\partial \theta} \right> ]$ - evolution of weights over time follows this kernel
• with very large width, this kernel is the NTK at initialization
• stays stable during training (since weights don’t change much)
• at initialization, artificial neural networks (ANNs) are equivalent to Gaussian processes in the infinite-width limit
• evolution of an ANN during training can also be described by a kernel (kernel gradient descent)
• different types of kernels impose different things on a function (e.g. want more / less low frequencies)
• gradient descent in kernel space can be convex if kernel is PD (even if nonconvex in the parameter space)
• understanding the neural tangent kernel (arora et al. 2019)
• method to compute the kernel quickly on a gpu
• Scaling description of generalization with number of parameters in deep learning (geiger et al. 2019)
• number of params = N
• above 0 training err, larger number of params reduces variance but doesn’t actually help
• ensembling with smaller N fixes problem
• the improvement of generalization performance with N in this classification task originates from reduced variance of fN when N gets large, as recently observed for mean-square regression
• On the Inductive Bias of Neural Tangent Kernels (bietti & mairal 2019)
• Kernel and Deep Regimes in Overparametrized Models (Woodworth…Srebro 2019)
• transition between kernel and deep regimes
• The HSIC Bottleneck: Deep Learning without Back-Propagation (Ma et al. 2019)
• directly optimize information bottleneck (approximated by HSIC) yields pretty good results

## implicit dl + optimization

• implicit deep learning (el ghaoui et al. 2019)
• $\hat y (u) = Cx + D u$, where $x = \phi(Ax + Bu)$
• here, $u$ is a new input
• $x$ is a hidden state which represents some hidden features (which depends on $u$)
• well-posedness - want x to be unique for a given u
• $A, B$ are matrices which let us compute $x$ given $u$
• $C, D$ help us do the final prediction (like the final linear layer)
• ex. feedforward nets
• consider net with $L$ layers
• $x_0 = u$
• $x_{l + 1} = \phi_l (W_l x_l)$
• $\hat y (u) = W_L x_L$
• rewriting in implicit form
• $x = (x_L, …, x_1)$ - concatenate all the activations into one big vector
• ex. $Ax + Bu= \begin{bmatrix} W_{L-1}x_{L-1} \ W_{L-2} x_{L-2} \ \vdots \ W_1x_1 \ \mathbf 0\end{bmatrix} + \begin{bmatrix} 0 \ 0 \ \vdots \ 0 \ W_0 u \end{bmatrix}$
• lifted neural networks (askari et al. 2018)
• can solve dnn $\hat y = \phi(W_2 \phi (W_1X_0))$ by rewriting using constraints:
• $X_1 = \phi(W_1 X_0)$
• $X_2 = \phi(W_2 X_1)$
• \begin{align} &\min (y - \hat y)^2\s.t. X_1 &= \phi(WX_0)\X_2 &= \phi(WX_1)\end{align}
• can be written using Lagrangian multipliers: $\min (y - \hat y)^2 + \lambda_1( X_1 - \phi(WX_0)) + \lambda_2(X_2 - \phi(WX_1))$
• Fenchel Lifted Networks: A Lagrange Relaxation of Neural Network Training (gu et al. 2018)
• in the lifted setting above, can replace Lagrangian with simpler expression using Fenchel conjugates
• robust optimization basics
• immunize optimization problems against uncertainty in the data
• do so by having worst-case constraints (e.g. $a < 5, \forall a$)
• local robustness - maximize radius surrounding parameter subject to all constraints (no objective to maximize)
• global robustness - maximize objective subject to robustness constraint (trades off robustness with objective value)
• non-probabilistic robust optimization models (e.g. Wald’s maximin model: $\underset{x}{\max} \underset{u}{\min} f(x, u)$
• also are probabilistically robust optimization
• distributional robustness - using moments in the dl work
• ch 4 and ch10 of robust optimization book (bental, el ghaoui, & nemirovski 2009)
• Certifying Some Distributional Robustness with Principled Adversarial Training (sinha, namkoong, & duchi 2018)
• want to guarantee performance under adversarial input perturbations
• considering a Lagrangian penalty formulation of perturbing the underlying data distribution in a Wasserstein ball
• during training, augments model parameter updates with worst-case perturbations of training data
• little extra cost and achieves guarantees for smooth losses
• On Distributionally Robust Chance-Constrained Linear Programs (calafiore & el ghaoui 2006)
• linear programs where data (in the constraints) is random
• want to enforce the constraints up to a given prob. level
• can convert the prob. constraints into convex 2nd-order cone constraints
• under distrs. for the random data, can guarantee constraints
• Differentiable Convex Optimization Layers (agrawal et al. 2019)

## statistical physics

• Statistical Mechanics of Deep Learning (bahri et al. 2019)
• what is the advantage of depth - connect to dynamical phase transitions
• there are several function which require only polynomial nodes in each layer for deep nets, but exponential for shallow nets
• what is the shape of the loss landscape - connect to random Gaussian processes, sping glasses, and jamming
• how to pick a good parameter initialization?
• bounding generalization error
• often, generalization bounds take the form $\epsilon_{test} \leq \epsilon_{train} + \frac {\mathcal C (\mathcal F)} p$, where $\mathcal C (\mathcal F)$ is the complexity of a function class and $p$ is the number of examples
• alternative framework: algorithmic stability - will generalize if map is stable wrt perturbations of the data $\mathcal D$
• altenative: PAC bounds suggest if distr. of weights doesn’t change much during training, generalization will be succesful
• deep linear networks
• student learns biggest singular values of the input-output correlation matrix $\Sigma = \sum_i y_i x_i^T$, so it learns the important stuff first and the noise last
• infinite-width limit
• if parameters are random, indcues a prior distribution $P(\mathcal F)$ over the space of functions
• in th limit of infinite width, this prior is Gaussian, with a specific correlation kernel
•  learning is similar to learning the Bayesian posterior $P(f data)$, but connecting this to sgd is still not clear
• Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics (kunin et al. 2020)
• instead, assumptions on symmetries embedded in a network’s architecture constrain training dynamics
• similar to Noether’s thm in physics
• can much better analytically describe learning dynamics
• weights have a differentiable symmetry in the loss if the loss doesn’t change under a certain differentiable transformation of the weights
• ex. translation symmetry - for layer before softmax, we get symmetries across weights that shift all outputs
• ex. scale symmetry - inputs to batch normalization are invariant to scaling
• ex. rescale symmetry - scaling up/down 2 things which multiply or add
• modeling discretization

# empirical studies

## interesting empirical papers

• robustness may be at odds with accuracy (madry 2019)
• adversarial training helps w/ little data but hurts with lots of data
• adversarially trained models have more meaningful gradients (and their adversarial examples actually look like other classes)
• Generalizability vs. Robustness: Adversarial Examples for Medical Imaging
• Towards Robust Interpretability with Self-Explaining Neural Networks
• Adversarial Attacks and Defenses in Images, Graphs and Text: A Review (xu et al. 2019)
• Adversarial examples are inputs to machine learning models that an attacker intentionally designed to cause the model to make mistakes
• threat models
• poisoning attack (insert fake samples into training data) vs. evasion attack (just evade at test time)
• targeted attack (want specific class) vs. non-targeted attack (just change the prediction)
• white-box - adversary knows everything
• black-box - can only feed inputs and get outputs
• gray-box - might have white box for limited amount of time
• security evaluation
• robustness - minimum norm perturbation to change class
• adversarial loss - biggest change in loss within some epsilon ball
• AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty (hendrycks et al. 2020)
• do a bunch of transformations and average images to create each training image
• certifying robustness
• bound gradients of f around x - solve SDP for 2-layer net (raghunathan, steinhardt, & liang 2018, iclr)
• relax SDP and then applies to multilayer nets (raghunathan et al. 2018)
• improved optimizer for the SDP (dathathri et al. 2020)
• interval bound propagation - instead of passing point, pass an interval where it could be
• works well for word substitions (jia et al. 2019)
• RobEn - cluster words that are confusable + give them the same encoding when you train (jones et al. 2020)
• then you are robust to confusing the words
• unlabeled data + self-training helps
• training robust models has higher sample complexity than training standard models
• tradeoff between robustness and accuracy (raghunathann et al. 2020)
• adding valid data can hurt even in well-specified, convex setting
• robust self-training eliminates this tradeoff in linear regression
• sample complexity can be reduced with unlabeled examples (carmon et al. 2019)
• distributional robust optimization
• instead of minimizing training err, minimize maximum training err over different perturbations
• hard to pick the perturbation set - can easily be too pessimistic
• these things possibly magnify disparities
• larger models
• selective classification
• feature noise
• removing spurious features
• group DRO (sagawa et al. 2020) - maximize error for worst group
• need to add regularization to keep errors from going to 0
• training overparameterized models makes this problem worse
• abstain from classifying based on confidence (jones et al. 2020)
• makes error rates worse for worst group = selective classification can magnify disparities
• adding feature noise to each group can magnify disparities
• standard domain adaptation: labeled (x, y) in source and unlabeled x in target
• gradual domain adaptation - things change slowly over time, can use gradual self-training (kumar et al. 2020)
• In-N-Out (xie et al. 2020) - if we have many features, rather than using them all as features, can use some as features and some as targets when we shift, to learn the domain shift

## tools for analyzing

• dim reduction: svcca, diffusion maps
• viz tools: bei wang
• visualizing loss landscape
• 1d: plot loss by extrapolating between 2 points (start/end, 2 ends)
• goodfellow et al. 2015, im et al. 2016
• exploring landscape with this technique
• 2d: plot loss on a grid in 2 directions
• important to think about scale invariance (dinh et al. 2017)
• want to scale direction vector to have same norm in each direction as filter
• use PCA to find important directions (ex. sample w at each step, pca to find most important directions of variance)

## memorization background

• memorization on single training example
• memorization in dnns
• “memorization” as the behavior exhibited by DNNs trained on noise, and conduct a series of experiments that contrast the learning dynamics of DNNs on real vs. noise data
• look at critical samples - adversarial exists nearby
• networks that generalize well have deep layers that are approximately linear with respect to batches of similar inputs
• networks that memorize their training data are highly non-linear with respect to similar inputs, even in deep layers
• expect that with respect to a single class, deep layers are approximately linear
• example forgetting paper
• secret sharing
• memorization in overparameterized autoencoders
• autoencoders don’t lean identity, but learn projection onto span of training examples = memorization of training examples
• sometimes output individual training images, not just project onto space of training images
• https://arxiv.org/pdf/1810.10333.pdf
• https://arxiv.org/pdf/1909.12362.pdf
• https://pdfs.semanticscholar.org/a624/6278cb5e2d0ab79fe20fe20a41c586732a11.pdf
• Auto-encoders: reconstruction versus compression

## architecture search background

• ideas: nested search (retrain each time), joint search (make arch search differentiable), one-shot search (train big net then search for subnet)

# basics

• good set of class notes
• demos to gain intuition
• colah
• overview / reviews

• some people involved: nathan srebro, sanjeev arora, jascha sohl-dickstein, tomaso poggio, stefano soatto, ben recht, olivier bousquet, jason lee, simon shaolei du
• interpretability: cynthia rudin, rich caruana, been kim, nicholas papernot, finale doshi-velez
• neuro: eero simoncelli, haim sompolinsky