# 5.7. deep learningÂ¶

## 5.7.1. neural networksÂ¶

• basic perceptron update rule

• if output is 0 but should be 1: raise weights on active connections by d

• if output is 1 but should be 0: lower weights on active connections by d

• perceptron convergence thm - if data is linearly separable, perceptron learning algorithm wiil converge

• transfer / activation functions

• sigmoid(z) = $$\frac{1}{1+e^{-z}}$$

• Binary step

• TanH (always preferred to sigmoid)

• Rectifier = ReLU

• Leaky ReLU - still has some negative slope when <0

• rectifying in electronics converts analog -> digital

• rare to mix and match neuron types

• deep - more than 1 hidden layer

• regression loss = $$\frac{1}{2}(y-\hat{y})^2$$

• classification loss = $$-y log (\hat{y}) - (1-y) log(1-\hat{y})$$

• canât use SSE because not convex here

• multiclass classification loss $$=-\sum_j y_j ln \hat{y}_j$$

• backpropagation - application of reverse mode automatic differentiation to neural networksâs loss

• apply the chain rule from the end of the program back towards the beginning

• $$\frac{dL}{dx_i} = \frac{dL}{dz} \frac{\partial z}{\partial x_i}$$

• sum $$\frac{dL}{dz}$$ if neuron has multiple outputs z

• L is output

• $$\frac{\partial z}{\partial x_i}$$ is actually a Jacobian (deriv each $$z_i$$ wrt each $$x_i$$ - these are vectors)

• each gate usually has some sparsity structure so you donât compute whole Jacobian

• pipeline

• initialize weights, and final derivative ($$\frac{dL}{dL}=1$$)

• for each batch

• run network forward to compute outputs at each step

• compute gradients at each gate with backprop

• update weights with SGD

## 5.7.2. trainingÂ¶

• happens with sigmoids

• exploding gradients problem - gradients are significantly larger in earlier layers than later layers

• RNNs

• batch normalization - whiten inputs to all neurons (zero mean, variance of 1)

• do this for each input to the next layer

• dropout - randomly zero outputs of p fraction of the neurons during training

• like learning large ensemble of models that share weights

• 2 ways to compensate (pick one)

1. at test time multiply all neuronsâ outputs by p

2. during training divide all neuronsâ outputs by p

• softmax - takes vector z and returns vector of the same length

• makes it so output sums to 1 (like probabilities of classes)

## 5.7.3. CNNsÂ¶

• kernel here means filter

• convolution G- takes a windowed average of an image F with a filter H where the filter is flipped horizontally and vertically before being applied

• G = H $$\ast$$ F

• if we do a filter with just a 1 in the middle, we get the exact same image

• you can basically always pad with zeros as long as you keep 1 in middle

• can use these to detect edges with small convolutions

• can do Guassian filters

• convolutions typically sum over all color channels

• 1x1 conv - still convolves over channels

• pooling - usually max - doesnât pool over depth

• people trying to move away from this - larger strides in conversation layers

• stacking small layers is generally better

• most of memory impact is usually from activations from each layer kept around for backdrop

• visualizations

• layer activations (maybe average over channels)

• visualize the weights (maybe average over channels)v

• feed a bunch of images and keep track of which activate a neuron most

• t-SNE embedding of images

• occluding

• weight matrices have special structure (Toeplitz or block Toeplitz)

• input layer is usually centered (subtract mean over training set)

• usually crop to fixed size (square input)

• receptive field - input region

• stride m - compute only every mth pixel

• downsampling

• max pooling - backprop error back to neuron w/ max value

• average pooling - backprop splits error equally among input neurons

• data augmentation - random rotations, flips, shifts, recolorings

• siamese networks - extract features twice with same net then put layer on top

• ex. find how similar to representations are

• famous cnns

• LeNet (1998)

• first, used on MNIST

• AlexNet (2012)

• landmark (5 conv layers, some pooling/dropout)

• ZFNet (2013)

• fine tuning and deconvnet

• VGGNet (2014)

• 19 layers, all 3x3 conv layers and 2x2 maxpooling

• lots of parallel elements (called Inception module)

• Msft ResNet (2015)

• very deep - 152 layers

• connections straight from initial layers to end

• only learn âresidualâ from top to bottom

• Region Based CNNs (R-CNN - 2013, Fast R-CNN - 2015, Faster R-CNN - 2015)

• object detection

• Karpathy Generating image descriptions (2014)

• RNN+CNN

• Spatial transformer networks (2015)

• transformations within the network

• Segnet (2015)

• encoder-decoder network

• Unet (2015)

• Ronneberger - applies to biomedical segmentation

• Pixelnet (2017)

• predicts pixel-level for different tasks with the same architecture

• convolutional layers then 3 FC layers which use outputs from all convolutional layrs together

• Squeezenet

• Yolonet

• Wavenet

• Densenet

• NASNET

• Efficientnet (2019)

## 5.7.4. architectural componentsÂ¶

• coordconv - break translation equivariance by passing in i, j coords as extra filters

• deconvolution = transposed convolution = fractionally-strided convolution - like upsampling

• attention

• attention ovw: https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

• vector of importance weights: in order to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with (or âattends toâ as you may have read in many papers) other elements and take the sum of their values weighted by the attention vector as the approximation of the target.

• attention born to solve problem of going from arbitrary length -> fixed length encoding -> arbitrary length output

## 5.7.6. RNNsÂ¶

• feedforward NNs have no memory so we introduce recurrent NNs

• able to have memory

• could theoretically unfold the network and train with backprop

• truncated - limit number of times you unfold

• $$state_{new} = f(state_{old},input_t)$$

• ex. $$h_t = tanh(W h_{t-1}+W_2 x_t)$$

• train with backpropagation through time (unfold through time)

• truncated backprop through time - only run every k time steps

• error gradients vanish exponentially quickly with time lag

### 5.7.6.1. LSTMsÂ¶

• have gates for forgetting, input, output

• easy to let hidden state flow through time, unchanged

• gate $$\sigma$$ - pointwise multiplication

• multiply by 0 - let nothing through

• multiply by 1 - let everything through

• forget gate - conditionally discard previously remembered info

• input gate - conditionally remember new info

• output gate - conditionally output a relevant part of memory

• GRUs - similar, merge input / forget units into a single update unit \vert

• adversarial attack - designing an input to get the wrong result from the model

• targeted - try to get specific wrong class

• fast gradient step method - keep adding gradient to maximize noise (limit amplitude of pixelâs channel to stay imperceptible)

## 5.7.8. transformersÂ¶

• transformers: https://arxiv.org/pdf/1706.03762.pdf

• spatial transformers: https://papers.nips.cc/paper/5854-spatial-transformer-networks.pdf

• http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

## 5.7.9. graph neural networksÂ¶

• Theoretical Foundations of Graph Neural Networks

• inputs are graphs

• e.g. molecule input to classification

• one big study: âa deep learning appraoch to antibiotic discoveryâ - using GNN classification of antibiotic resistance, came up with 100 candidate antibiotics and were able to test them

• e.g. traffic maps - nodes are intersections

• invariances in CNNs: translational, neighbor pixels relate a lot more

• simplest setup: no edges, each node $$i$$ has a feature vector $$x_i$$ (really a set not a graph)

• X is a matrix where each row is a feature vector

• note that permuting rows of X shouldnât change anything

• permutation invariant: $$f(PX) = f(X)$$ for all permutation matrices $$P$$

• e.g. $$\mathbf{P}_{(2,4,1,3)} \mathbf{X}=\left[\begin{array}{llll}0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0\end{array}\right]\left[\begin{array}{lll}- & \mathbf{x}_{1} & - \\ - & \mathbf{x}_{2} & - \\ - & \mathbf{x}_{3} & - \\ - & \mathbf{x}_{4} & -\end{array}\right]=\left[\begin{array}{lll}- & \mathbf{x}_{2} & - \\ - & \mathbf{x}_{4} & - \\ - & \mathbf{x}_{1} & - \\ - & \mathbf{x}_{3} & -\end{array}\right]$$

• ex. Deep Sets model (zaheer et al. â17): $$f(X) = \phi \left (\sum_k \psi(x_i) \right)$$

• permutation equivariant: $$f(PX) = P f(X)$$ - useful for when we want answers at the node level

• graph: augment set of nodes with edges between them (store as an adjacency matrix)

• permuting permutation matrix to A requires operating on both rows and cols: $$PAP^T$$

• permutation invariance: $$f\left(\mathbf{P X}, \mathbf{P A P}^{\top}\right)=f(\mathbf{X}, \mathbf{A})$$

• permutation equivariance: $$f\left(\mathbf{P X}, \mathbf{P A P}^{\top}\right)=\mathbf{P} f(\mathbf{X}, \mathbf{A})$$

• can now write an equivariant function that is extracts features not only of X, but also its neighbors: $$g(x_b, X_{\mathcal N_b})$$

• 3 flavors of GNN layers for extracting features from nodes / neighbors: simplest to most complex

• message-passing actually passes vectors to be sent across edges

• previous approaches map on to gnns well

• GNNs explicitly construct local features, much like previous works

• local objectives: features of nodes i and j should predict existence of edge $$(i, j)$$

• random-walk objectives: features should be similar if i and j co-occur on a short random walk (e.g. deepwalk, node2vec, line)

• similarities to NLP if we think of words as nodes and sentences as walks

• we can think of transformers as fully-connect graph networks with attentional form of GNN layers

• one big difference: positional embeddings often used, making the input not clearly a graph

• these postitional embeddings often take the form of sin/cos - very similar to DFT eigenvectors of a graph

• spectral gnns

• operate on graph laplacian matrix $$L = D - A$$ where $$D$$ is degree matrix and $$A$$ is adjacency matrix - more mathematically convenient

• probabilistic modeling - e.g. assome markov random field and try to learn parameters

• this connects well to a message-passing GNN

• GNN limitations

• ex. can we tell whether 2 graphs are isomorphic - often no?

• cank make GNNs more powerful by adding positional features, etc.

• can also embed sugraphs together

• things are more difficult in continuous caseâŠ

• geometric deep learning: invariances and equivariances can be appllied generally to get a large calss of architectures between convolutions and graphs

## 5.7.10. miscÂ¶

• adaptive pooling can help deal with different sizes