comp neuro view markdown
Diverse notes on various topics in computational neuro, datadriven neuro, and neuroinspired AI.
introduction
overview
 lacking: insight from neuro that can help build machine
 scales: cortex, column, neuron, synapses
 physics: theory and practice are much closer
 are there principles?
 “god is a hacker”  francis crick
 theorists are lazy  ramon y cajal
 things seemed like mush but became more clear  horace barlow
history

ai
 people: turing, von neumman, marvin minsky, mccarthy…
 ai: birth at 1956 conference
 vision: marvin minsky thought it would be a summer project
 lighthill debate 1973  was ai worth funding?
 intelligence tends to be developed by young children
 cortex grew very rapidly

cybernetics / artficial neuro nets

people: norbert weiner, mcculloch & pitts, rosenblatt

neuro
 hubel & weisel (1962, 1965) simple, complex, hypercomplex cells
 neocognitron fukushima (1980)
 david marr: theory, representation, implementation
 felleman & van essen (1991)
 ascending layers (e.g. v1> v2): goes from superficial to deep layers
 descending layers (e.g. v2 > v1): deep layers to superficial
 solari & stoner (2011) “cognitive consilience”  layers thicknesses change in different parts of the brain
 motor cortex has much smaller input (layer 4), since it is mostly output

types of models
 three types
 descriptive brain model  encode / decode external stimuli
 mechanistic brian cell / network model  simulate the behavior of a single neuron / network
 interpretive (or normative) brain model  why do brain circuits operate how they do

receptive field  the things that make a neuron fire
 retina has oncenter / offsurround cells  stimulated by points
 then, V1 has differently shaped receptive fields

efficient coding hypothesis  brain learns different combinations (e.g. lines) that can efficiently represent images
 sparse coding (Olshausen and Field, 1996)
 ICA (Bell and Sejnowski, 1997)
 predictive coding (Rao and Ballard, 1999)
 brain is trying to learn faithful and efficient representations of an animal’s natural environment
biophysical models
modeling neurons

membrane can be treated as a simple circuit, with a capacitor and resistor
 nernst battery
 osmosis (for each ion)
 electrostatic forces (for each ion)
 together these yield Nernst potential $E = \frac{k_B T}{zq} ln \frac{[in]}{[out]}$
 T is temp
 q is ionic charge
 z is num charges
 part of voltage is accounted for by nernst battery $V_{rest}$
 yields $\tau \frac{dV}{dt} = V + V_\infty$ where $\tau=R_mC_m=r_mc_m$
 equivalently, $\tau_m \frac{dV}{dt} = ((VE_L)  g_s(t)(VE_s) r_m) + I_e R_m $
 together these yield Nernst potential $E = \frac{k_B T}{zq} ln \frac{[in]}{[out]}$
simplified model neurons
 integrateandfire neuron
 passive membrane (neuron charges)
 when $V = V_{thresh}$, a spike is fired
 then $V = V_{reset}$
 approximation is poor near threshold
 can include threshold by saying
 when $V = V_{max}$, a spike is fired
 then $V = V_{reset}$
 modeling multiple variables
 also model a K current
 can capture things like resonance
 theta neuron (Ermentrout and Kopell)
 often used for periodically firing neurons (it fires spontaneously)
modeling dendrites / axons
 cable theory  Kelvin
 voltage V is a function of both x and t
 separate into sections that don’t depend on x
 coupling conductances link the sections (based on area of compartments / branching)
 Rall model for dendrites
 if branches obey a certain branching ratio, can replace each pair of branches with a single cable segment with equivalent surface area and electrotonic length
 $d_1^{3/2} = d_{11}^{3/2} + d_{12}^{3/2}$
 if branches obey a certain branching ratio, can replace each pair of branches with a single cable segment with equivalent surface area and electrotonic length
 dendritic computation (London and Hausser 2005)
 hippocampus  when inputs arrive at soma, similiar shape no matter where they come in = synaptic scaling
 where inputs enter influences how they sum
 dendrites can generate spikes (usually calcium) / backpropagating spikes
 ex. Jeffress model  sound localized based on timing difference between ears
 ex. direction selectivity in retinal ganglion cells  if events arive at dendrite far > close, all get to soma at same time and add
circuitmodeling basics
 membrane has capacitance $C_m$
 force for diffusion, force for drift
 can write down diffeq for this, which yields an equilibrium
 $\tau = RC$
 bigger $\tau$ is slower
 to increase capacitance
 could have larger diameter $C_m \propto D$
 axial resistance $R_A \propto 1/D^2$ (not same as membrane leak), thus bigger axons actually charge faster
action potentials
 channel/receptor types
 ionotropic: $G_{ion}$ = f(molecules outside)
 something binds and opens channel
 metabotropic: $G_{ion}$ = f(molecules inside)
 doesn’t directly open a channel: indirect
 others
 photoreceptor
 hair cell
 voltagegated (active  provide gain; might not require active ATP, other channels are all passive)
 ionotropic: $G_{ion}$ = f(molecules outside)
physics of computation
 drift and diffusion are at the heart of everything (based on carver mead)
 Boltzmann distr. models many things (ex. distr of air molecules vs elevation. Subject to gravity and diffusion upwards since they’re colliding)
 nernst potential
 currentvoltage relation of voltagegated channels
 currentvoltage relation of MOS transistor
 these things are all like a transistor: energy barrier that must be overcome
spiking neurons
 passive membrane model was leaky integrator
 voltagegaed channels were more complicated
 can be though of as leaky integrateandfire neuron (LIF)
 this charges up and then fires a spike, has refractory period, then starts charging up again
 rate coding hypothesis  signal conveyed is the rate of spiking (some folks think is too simple)
 spiking irregularly is largely due to noise and doesn’t convey information
 some neurons (e.g. neurons in LIP) might actually just convey a rate
 linearnonlinearpoisson model (LNP)  sometimes called GLM (generalized linear model)
 based on observation that variance in firing rate $\propto$ mean firing rate
 plotting mean vs variance = 1 $\implies$ Poisson output
 these led people to model firing rates as Poisson $\frac {\lambda^n e^{\lambda}} {n!}$
 bruno doesn’t really believe the firing is random (just an effect of other things we can’t measure)
 ex. fly H1 neuron 1997
 constant stimulus looks very Poisson
 moving stimulus looks very Bernoulli
 based on observation that variance in firing rate $\propto$ mean firing rate
 spike timing hypothesis
 spike timing can be very precise in response to timevarying signals (mainen & sejnowski 1995; bair & koch 1996)
 often see precise timing
 encoding: stimulus $\to$ spikes
 decoding: spikes $\to$ representation
 encoding + decoding are related through the joint distr. over simulus and repsonse (see Bialek spikes book)
 nonlinear encoding function can yield linear decoding
 able to directly decode spikes using a kernel to reproduce signal (seems to say you need spikes  rates would not be good enough)
 some reactions happen too fast to average spikes (e.g. 30 ms)
 estimating information rate: bits (usually better than snr  can calculate between them)  usually 23 bits/spike
neural coding
neural encoding
defining neural code
 encoding: P(response  stimulus)
 tuning curve  neuron’s response (ex. firing rate) as a function of stimulus
 orientation / color selective cells are distributed in organized fashion
 some neurons fire to a concept, like “Pamela Anderson”
 retina (simple) > V1 (orientations) > V4 (combinations) > ?
 also massive feedback
 decoding: P(stimulus  response)
simple encoding
 want P(response  stimulus)
 response := firing rate r(t)
 stimulus := s

simple linear model
 r(t) = c * s(t)

weighted linear model  takes into account previous states weighted by f
 temporal filtering
 r(t) = $f_0 \cdot s_0 + … + f_t \cdot s_t = \sum s_{tk} f_k$ where f weights stimulus over time
 could also make this an integral, yielding a convolution:
 r(t) = $\int_{\infty}^t d\tau : s(t\tau) f(\tau)$
 a linear system can be thought of as a system that searches for portions of the signal that resemble its filter f
 leaky integrator  sums its inputs with f decaying exponentially into the past
 flaws
 no negative firing rates
 no extremely high firing rates
 can add a nonlinear function g of the linear sum can fix this
 r(t) = $g(\int_{\infty}^t d\tau : s(t\tau) f(\tau))$
 spatial filtering
 r(x,y) = $\sum_{x’,y’} s_{xx’,yy’} f_{x’,y’}$ where f again is spatial weights that represent the spatial field
 could also write this as a convolution
 for a retinal center surround cell, f is positive for small $\Delta x$ and then negative for large $\Delta x$
 can be calculated as a narrow, large positive Gaussian + spread out negative Gaussian  can combine above to make spatiotemporal filtering  filtering = convolution = projection
 temporal filtering
feature selection
 P(responsestimulus) is very hard to get
 stimulus can be highdimensional (e.g. video)
 stimulus can take on many values
 need to keep track of stimulus over time
 solution: sample P(responses) to many stimuli to characterize what in input triggers responses
 find vector f that captures features that lead to spike
 dimensionality reduction  ex. discretize
 value at each time $t_i$ is new dimension
 commonly use Gaussian white noise
 time step sets cutoff of highest frequency present
 prior distribution  distribution of stimulus
 multivariate Gaussian  Gaussian in any dimension, or any linear combination of dimensions
 look at where spiketriggering points are and calculate spiketriggered average f of features that led to spike
 use this f as filter
 determining the nonlinear input/output function g
 replace stimulus in P(spikestimulus) with P(spike$s_1$), where s1 is our filtered stimulus
 use bayes rule $g=P(spikes_1)=\frac{P(s_1spike)P(spike)}{P(s_1)}$
 if $P(s_1spike) \approx P(s_1)$ then response doesn’t seem to have to do with stimulus
 replace stimulus in P(spikestimulus) with P(spike$s_1$), where s1 is our filtered stimulus
 incorporating many features $f_1,…,f_n$
 here, each $f_i$ is a vector of weights
 $r(t) = g(f_1\cdot s,f_2 \cdot s,…,f_n \cdot s)$
 could use PCA  discovers lowdimensional structure in highdimensional data
 each f represents a feature (maybe a curve over time) that fires the neuron
variability

hidden assumptions about timevarying firing rate and single spikes
 smooth function RFT can miss some stimuli

statistics of stimulus can effect P(spikestimulus)
 Gaussian white noise is nice because no way to filter it to get structure
 identifying good filter
 want $P(s_fspike)$ to differ from $P(s_f)$ where $s_f$ is calculated via the filter
 instead of PCA, could look for f that directly maximizes this difference (Sharpee & Bialek, 2004)
 KullbackLeibler divergence  calculates difference between 2 distributions
 $D_{KL}(P(s),Q(s)) = \int ds P(s) log_2 P(s) / Q(s)$
 maximizing KL divergence is equivalent to maximizing mutual info between spike and stimulus
 this is because we are looking for most informative feature
 this technique doesn’t require that our stimulus is white noise, so can use natural stimuli
 maximization isn’t guaranteed to uniquely converge
 modeling the noise

need to go from r(t) > spike times

divide time T into n bins with p = probability of firing per bin

over some chunk T, number of spikes follows binomial distribution (n, p)

if n gets very large, binomial approximates Poisson
 $\lambda$ = spikes in some set time (mean = $\lambda$, var = $\lambda$)
 can test if distr is Poisson with Fano factor=mean/var=1
 interspike intervals have exponential distribution  if fires a lot, this can be bad assumption (due to refractory period)
 $\lambda$ = spikes in some set time (mean = $\lambda$, var = $\lambda$)


generalized linear model adds explicit spikegeneration / postspike filter (Pillow et al. 2008)
 $P(\text{spike at }t)\sim\exp((f_1s + h_1r)) $
 postspike filter models refractory period
 Paninski showed that using exponential nonlinearity allows this to be optimized
 could add in firing of other neurons
 timerescaling theorem  tests how well we have captured influences on spiking (Brown et al 2001)
 scaled ISIs ($t_{i1}t_i$) r(t) should be exponential
neural decoding
neural decoding and signal detection
 decoding: P(stimulus  response)  ex. you hear noise and want to tell what it is
 here $r$ = response = firing rate
 monkey is trained to move eyes in same direction as dot pattern (Britten et al. 92)
 when dots all move in same direction (100% coherence), easy
 neuron recorded in MT  tracks dots
 count firing rate when monkey tracks in right direction
 count firing rate when monkey tracks in wrong direction
 as coherence decreases, these firing rates blur
 need to get P(+ or   r)
 can set a threshold on r by maximizing likelihood
 P(r+) and P(r) are likelihoods
 NeymanPearson lemma  likelihood ratio test is the most efficient statistic, in that is has the most power for a given size
 $\frac{p(r+)}{p(r)} > 1?$
 can set a threshold on r by maximizing likelihood
 when dots all move in same direction (100% coherence), easy
 accumulated evidence  we can accumulate evidence over time by multiplying these probabilities
 instead we take sum the logs, and compare to 0
 $\sum_i ln \frac{p(r_i+)}{p(r_i)} > 0?$
 once we hit some threshold for this sum, we can make a decision + or 
 experimental evidence (Kiani, Hanks, & Shadlen, Nat. Neurosci 2006)
 monkey is making decision about whether dots are moving left/right
 neuron firing rates increase over time, representing integrated evidence
 neuron always seems to stop at same firing rate
 priors  ex. tiger is much less likely then breeze
 scale P(+r) by prior P(+)
 neuroscience ex. photoreceptor cells P(noiser) is much larger than P(signalr)
 therefore threshold on r is high to minimize total mistakes
 cost of acting/not acting
 loss for predicting + when it is : $L_ \cdot P[+r]$
 loss for predicting  when it is +: $L_+ \cdot P[r]$
 cut your losses: answer + when average Loss$+$ < Loss$$
 i.e. $L_+ \cdot P[r]$ < $L_ \cdot P[+r]$
 rewriting with Baye’s rule yields new test:
 $\frac{p(r+)}{p(r)}> L_+ \cdot P[] / L_ \cdot P[+]$
 here the loss term replaces the 1 in the NeymanPearson lemma
population coding and bayesian estimation
 population vector  sums vectors for cells that point in different directions weighted by their firing rates
 ex. cricket cercal cells sense wind in different directions
 since neuron can’t have negative firing rate, need overcomplete basis so that can record wind in both directions along an axis
 can do the same thing for direction of arm movement in a neural prosthesis
 not general  some neurons aren’t tuned, are noisier
 not optimal  making use of all information in the stimulus/response distributions
 bayesian inference
 $p(sr) = \frac{p(rs)p(s)}{p( r)}$
 maximum likelihood: s* which maximizes p(rs)
 MAP = maximum a posteriori: s* which mazimizes p(sr)
 simple continuous stimulus example
 setup
 s  orientation of an edge
 each neuron’s average firing rate=tuning curve $f_a(s)$ is Gaussian (in s)
 let $r_a$ be number of spikes for neuron a
 assume receptive fields of neurons span s: $\sum r_a (s)$ is const
 solving
 maximizing loglikelihood with respect to s 
 take derivative and set to 0
 soln $s^* = \frac{\sum r_a s_a / \sigma_a^2}{\sum r_a / \sigma_a^2}$
 if all the $\sigma$ are same, $s^* = \frac{\sum r_a s_a}{\sum r_a}$
 this is the population vector
 maximum a posteriori
 $ln : p(sr) = ln : P(rs) + ln : p(s) = ln : P(r )$
 $s^* = \frac{T \sum r_a s_a / \sigma^2a + s{prior} / \sigma^2{prior}}{T \sum r_a / \sigma^2_a + 1/\sigma^2{prior}}$
 this takes into account the prior
 narrow prior makes it matter more
 doesn’t incorporate correlations in the population
 setup
stimulus reconstruction
 decoding s > $s^*$
 want an estimator $s_{Bayes}=s_B$ given some response r
 error function $L(s,s_{B})=(ss_{B})^2$
 minimize $\int ds : L(s,s_{B}) : p(sr)$ by taking derivative with respect to $s_B$
 $s_B = \int ds : p(sr) : s$  the conditional mean (spiketriggered average)
 add in spiketriggered average at each spike
 if spiketriggered average looks exponential, can never have smooth downwards stimulus
 could use 2 neurons (like in H1) and replay the second with negative sign
 LGN neurons can reconstruct a video, but with noise
 recreated 1 sec long movies  (Jack Gallant  Nishimoto et al. 2011, Current Biology)
 voxelbased encoding model samples ton of prior clips and predicts signal
 get p(rs)
 pick best p(rs) by comparing predicted signal to actual signal
 input is filtered to extract certain features
 filtered again to account for slow timescale of BOLD signal
 decoding
 maximize p(sr) by maximizing p(rs) p(s), and assume p(s) uniform
 30 signals that have highest match to predicted signal are averaged
 yields pretty good pictures
 voxelbased encoding model samples ton of prior clips and predicts signal
information theory
information and entropy
 surprise for seeing a spike h(p) = $log_2 (p)$
 entropy = average information
 code might not align spikes with what we are encoding
 how much of the variability in r is encoding s
 define q as en error
 $P(r_+s=+)=1q$
 $P(r_s=+)=q$
 similar for when s=
 total entropy: $H(R ) =  P(r_+) log P(r_+)  P(r_)log P(r_)$
 noise entropy: $H(RS=+) = q log q  (1q) log (1q)$
 mutual info I(S;R) = $H(R )  H(RS) $ = total entropy  average noise entropy
 = $D_{KL} (P(R,S), P(R )P(S))$
 define q as en error
 grandma’s famous mutual info recipe
 for each s
 P(Rs)  take one stimulus and repeat many times (or run for a long time)
 H(Rs)  noise entropy
 $H(RS)=\sum_s P(s) H(Rs)$
 $H(R ) $ calculated using $P(R ) = \sum_s P(s) P(Rs)$
 for each s
info in spike trains
 information in spike patterns
 divide pattern into time bins of 0 (no spike) and 1 (spike)
 binary words w with letter size $\Delta t$, length T (Reinagel & Reid 2000)
 can create histogram of each word
 can calculate entropy of word  look at distribution of words for just one stimulus
 distribution should be narrower  calculate $H_{noise}$  average over time with random stimuli and calculate entropy
 varied parameters of word: length of bin (dt) and length of word (T)
 there’s some limit to dt at which information stops increasing
 this represents temporal resolution at which jitter doesn’t stop response from identifying info about the stimulus
 corrections for finite sample size (Panzeri, Nemenman,…)
 information in single spikes  how much info does single spike tell us about stimulus
 don’t have to know encoding, mutual info doesn’t care
 calculate entropy for random stimulus  $p=\bar{r} \Delta t$ where $\bar{r}$ is the mean firing rate
 calculate entropy for specific stimulus
 let $P(r=1s) = r(t) \Delta t$
 let $P(r=0s) = 1  r(t) \Delta t$
 get r(t) by having simulus on for long time
 ergodicity  a time average is equivalent to averging over the s ensemble
 info per spike $I(r,s) = \frac{1}{T} \int_0^T dt \frac{r(t)}{\bar{r}} log \frac{r(t)}{\bar{r}}$
 timing precision reduces r(t)
 low mean spike rate > high info per spike
 ex. rat runs through place field and only fires when it’s in place field
 spikes can be sharper, more / less frequent
coding principles
 natural stimuli
 huge dynamic range  variations over many orders of magnitude (ex. brightness)
 power law scaling  structure at many scales (ex. far away things)
 efficient coding  in order to have maximum entropy output, a good encoder should match its outputs to the distribution of its inputs
 want to use each of our “symbols” (ex. different firing rates) equally often
 should assign equal areas of input stimulus PDF to each symbol
 adaptation to stimulus statistics
 feature adaptation (Atick and Redlich)
 spatial filtering properties in retina / LGN change with varying light levels
 at low light levels surround becomes weaker
 feature adaptation (Atick and Redlich)
 coding sechemes
 redundancy reduction
 population code $P(R_1,R_2)$
 entropy $H(R_1,R_2) \leq H(R_1) + H(R_2)$  being independent would maximize entropy
 correlations can be good
 error correction and robust coding
 correlations can help discrimination
 retina neurons are redundant (Berry, Chichilnisky)
 more recently, sparse coding
 penalize weights of basis functions
 instead, we get localized features
 redundancy reduction
 we ignored the behavioral feedback loop
computing with networks
modeling connections between neurons
 model effects of synapse by using synaptic conductance $g_s$ with reversal potential $E_s$
 $g_s = g_{s,max} \cdot P_{rel} \cdot P_s$
 $P_{rel}$  probability of release given an input spike
 $P_s$  probability of postsynaptic channel opening = fraction of channels opened
 $g_s = g_{s,max} \cdot P_{rel} \cdot P_s$
 basic synapse model
 assume $P_{rel}=1$
 model $P_s$ with kinetic model
 open based on $\alpha_s$
 close based on $\beta_s$
 yields $\frac{dP_s}{dt} = \alpha_s (1P_s)  \beta_s P_s$
 3 synapse types
 AMPA  wellfit by exponential
 GAMA  fit by “alpha” function  has some delay
 NMDA  fit by “alpha” function  has some delay
 linear filter model of a synapse
 pick filter (ex. K(t) ~ exponential)
 $g_s = g_{s,max} \sum K(tt_i)$
 network of integrateandfire neurons
 if 2 neurons inhibit each other, get synchrony (fire at the same time
intro to network models
 comparing spiking models to firingrate models
 advantages
 spike timing
 spike correlations / synchrony between neurons
 disadvantages
 computationally expensive
 uses linear filter model of a synapse
 advantages
 developing a firingrate model
 replace spike train $\rho_1(t) \to u_1(t)$
 can’t make this replacement when there are correlations / synchrony?
 input current $I_s$: $\tau_s \frac{dI_s}{dt}=I_s + \mathbf{w} \cdot \mathbf{u}$
 works only if we let K be exponential
 output firing rate: $\tau_r \frac{d\nu}{dt} = \nu + F(I_s(t))$
 if synapses are fast ($\tau_s « \tau_r$)
 $\tau_r \frac{d\nu}{dt} = \nu + F(\mathbf{w} \cdot \mathbf{u}))$
 if synapses are slow ($\tau_r « \tau_s$)
 $\nu = F(I_s(t))$
 if static inputs (input doesn’t change)  this is like artificial neural network, where F is sigmoid
 $\nu_{\infty} = F(\mathbf{w} \cdot \mathbf{u})$
 could make these all vectors to extend to multiple output neurons
 replace spike train $\rho_1(t) \to u_1(t)$
 recurrent networks
 $\tau \frac{d\mathbf{v}}{dt} = \mathbf{v} + F(W\mathbf{u} + M \mathbf{v})$
 $\mathbf{v}$ is decay
 $W\mathbf{u}$ is input
 $M \mathbf{v}$ is feedback
 with constant input, $v_{\infty} = W \mathbf{u}$
 ex. edge detectors
 V1 neurons are basically computing derivatives
 $\tau \frac{d\mathbf{v}}{dt} = \mathbf{v} + F(W\mathbf{u} + M \mathbf{v})$
recurrent networks
 linear recurrent network: $\tau \frac{d\mathbf{v}}{dt} = \mathbf{v} + W\mathbf{u} + M \mathbf{v}$
 let $\mathbf{h} = W\mathbf{u}$
 want to investigate different M
 can solve for $\mathbf{v}$ using eigenvectors
 suppose M (NxN) is symmetric (connections are equal in both directions)
 $\to$ M has N orthogonal eigenvectors / eigenvalues
 let $e_i$ be the orthonormal eigenvectors
 output vector $\mathbf{v}(t) = \sum c_i (t) \mathbf{e_i}$
 allows us to get a closedform solution for $c_i(t)$
 eigenvalues determine network stability
 if any $\lambda_i > 1, \mathbf{v}(t)$ explodes $\implies$ network is unstable
 otherwise stable and converges to steadystate value
 $\mathbf{v}_\infty = \sum \frac{h\cdot e_i}{1\lambda_i} e_i$
 amplification of input projection by a factor of $\frac{1}{1\lambda_i}$
 if any $\lambda_i > 1, \mathbf{v}(t)$ explodes $\implies$ network is unstable
 suppose M (NxN) is symmetric (connections are equal in both directions)
 ex. each output neuron codes for an angle between 180 to 180
 define M as cosine function of relative angle
 excitation nearby, inhibition further away
 memory in linear recurrent networks
 suppose $\lambda_1=1$ and all other $\lambda_i < 1$
 then $\tau \frac{dc_1}{dt} = h \cdot e_1$  keeps memory of input
 ex. memory of eye position in medial vestibular nucleus (Seung et al. 2000)
 integrator neuron maintains persistent activity
 nonlinear recurrent networks: $\tau \frac{d\mathbf{v}}{dt} = \mathbf{v} + F(\mathbf{h}+ M \mathbf{v})$
 ex. ReLu F(x) = max(0,x)
 ensures that firing rates never go below
 can have eigenvalues > 1 but stable due to rectification
 can perform selective “attention”
 network performs “winnertakesall” input selection
 gain modulation  adding constant amount to input h multiplies the output
 also maintains memory
 ex. ReLu F(x) = max(0,x)
 nonsymmetric recurrent networks
 ex. excitatory and inhibitory neurons
 linear stability analysis  find fixed points and take partial derivatives
 use eigenvalues to determine dynamics of the nonlinear network near a fixed point
hopfield nets
 hopfield nets can store / retrieve memories
 marrpogio stereo algorithm
 binary Hopfield networks were introduced as associative memories that can store and retrieve patterns (Hopfield, 1982)
 network with dimension $d$ can store $d$ uncorrelated patterns, but fewer correlated patterns
 in contrast to the storage capacity, the number of energy minima (spurious states, stable states) of Hopfield networks is exponential in $d$ (Tanaka & Edwards, 1980; Bruck & Roychowdhury, 1990; Wainrib & Touboul, 2013)
 energy function only has pairwise connections
 fully connected (no input/output)  activations are what matter
 can memorize patterns  starting with noisy patterns can converge to these patterns
 hopfield threeway connections
 $E =  \sum_{i, j, k} T_{i, j, k} V_i V_j V_k$ (self connections set to 0)
 update to $V_i$ is now bilinear
 $E =  \sum_{i, j, k} T_{i, j, k} V_i V_j V_k$ (self connections set to 0)
 modern hopfield network = dense associative memory (DAM) model
 use an energy function with interaction functions of the form $F (x) = x^n$ and achieve storage capacity $\propto d^{n1}$ (Krotov & Hopfield, 2016; 2018)
learning
supervised learning
 net talk was major breakthrough (words > audio) Sejnowski & Rosenberg 1987
 people looked for worldcentric receptive fields (so neurons responded to things not relative to retina but relative to body) but didn’t find them
 however, they did find gain fields: (Zipser & Anderson, 1987)
 gain changes based on what retina is pointing at
 trained nn to go from pixels to headcentered coordinate frame
 yielded gain fields
 pouget et al. were able to find that this helped having 2 pop vectors: one for retina, one for eye, then add to account for it
 however, they did find gain fields: (Zipser & Anderson, 1987)
 support vector networks (vapnik et al.)  svms early inspired from NNs
 dendritic nonlinearities (hausser & mel, 2003)
 example to think about neurons do this: $u = w_1 x_1 + w_2x_2 + w_{12}x_1x_2$
 $y=\sigma(u)$
 somestimes called sigmapi unit since it’s a sum of products
 exponential number of params…could be fixed w/ kernel trick?
 could also incorporate geometry constraint
unsupervised learning
 born w/ extremely strong priors on weights in different areas
 barlow 1961, attneave 1954: efficient coding hypothesis = redundancy reduction hypothesis
 representation: compression / usefulness
 easier to store prior probabilities (because inputs are independent)
 relich 93: redundancy reduction for unsupervised learning (text ex. learns words from text w/out spaces)
hebbian learning and pca
 pca can also be thought of as a tool for decorrelation (pc coefs tend to be less correlated)
 hebbian learning = fire together, wire together: $\Delta w_{ab} \propto <a, b>$ note: $<a, b>$ is correlation of a and b (average over time)
 linear hebbian learning (perceptron with linear output)
 $\dot{w}_i \propto <y, x_i> \propto \sum_j w_j <x_j, x_i>$ since weights change relatively slowly
 synapse couldn’t do this, would grow too large
 oja’s rule (hebbian learning w/ weight decay so ws don’t get too big)
 points to correct direction
 sanger’s rule: for multiple neurons, fit residuals of other neurons
 competitive learning rule: winner take all
 population nonlinearity is a max
 gets stuck in local minima (basically kmeans)
 pca only really good when data is gaussian
 interesting problems are nongaussian, nonlinear, nonconvex
 pca: yields checkerboards that get increasingly complex (because images are smooth, can describe with smaller checkerboards)
 this is what jpeg does
 very similar to discrete cosine transform (DCT)
 very hard for neurons to get receptive fields that look like this
 retina: does whitening (yields centersurround receptive fields)
 easier to build
 gets more even outputs
 only has ~1.5 million fibers

 most active neuron is the one whose w is closest to x
 competitive learning
 updating weights given a new input
 pick a cluster (corresponds to most active neuron)
 set weight vector for that cluster to running average of all inputs in that cluster
 $\Delta w = \epsilon \cdot (\mathbf{x}  \mathbf{w})$
 updating weights given a new input
synaptic plasticity, hebb’s rule, and statistical learning
 if 2 spikes keep firing at same time, get LTP  longterm potentiation
 if input fires, but not B then could get LTD  longterm depression
 Hebb rule $\tau_w \frac{d\mathbf{w}}{dt} = \mathbf{x}v$
 $\mathbf{x}$  input
 $v$  output
 translates to $\mathbf{w}_{i+1}=\mathbf{w}_i + \epsilon \cdot \mathbf{x}v$
 average effect of the rule is to change based on correlation matrix $\mathbf{x}^T\mathbf{x}$
 covariance rule: $\tau_w \frac{d\mathbf{w}}{dt} = \mathbf{x}(vE[v])$
 includes LTD as well as LTP
 Oja’s rule: $\tau_w \frac{d\mathbf{w}}{dt} = \mathbf{x}v \alpha v^2 \mathbf{w}$ where $\alpha>0$
 stability
 Hebb rule  derivative of w is always positive $\implies$ w grows without bound
 covariance rule  derivative of w is still always positive $\implies$ w grows without bound
 could add constraint that $w=1$ and normalize w after every step
 Oja’s rule  $w = 1/\sqrt{\alpha}$, so stable
 solving Hebb rule $\tau_w \frac{d\mathbf{w}}{dt} = Q w$ where Q represents correlation matrix
 write w(t) in terms of eigenvectors of Q
 lets us solve for $\mathbf{w}(t)=\sum_i c_i(0)\exp(\lambda_i t / \tau_w) \mathbf{e}_i$
 when t is large, largest eigenvalue dominates
 hebbian learning implements PCA
 hebbian learning learns w aligned with principal eigenvector of input correlation matrix
 this is same as PCA
tensor product representation (TPR)
 tensor product representation = TPR
 Tensor product variable binding and the representation of symbolic structures in connectionist systems (paul smolensky, 1990)  activation patterns are “symbols” and internal structure allows them to be processed like symbols
 filler  one vector that embeds the content of the constituent
 role  second vector that embeds the structural role it fills
 TPR is built by summing the outer product between roles and fillers:

 can optionally flatten the final TPR as is done in mccoy…smolensky, 2019 if we want to compare it to a vector or an embedding
 TPR of a structure is the sum of the TPR of its constituents
 tensor product operation allows constituents to be uniquely identified, even after the sum (if roles are linearly independent)
 TPR intro blog post
 TPR slides
 RNNs Implicitly Implement Tensor Product Representations (mccoy…smolensky, 2019)
 introduce TP Decomposition Networks (TPDNs), which use TPRs to approximate existing vector representations
 assumes a particular hypothesis for the relevant set of roles (e.g., sequence indexes or structural positions in a parse tree)
 TPDNs can successfully approximate linear and treebased RNN autoencoder representations
 evaluate TPDN based on how well the decoder applied to the TPDN representation produces the same output as the original RNN
 introduce TP Decomposition Networks (TPDNs), which use TPRs to approximate existing vector representations
 Discovering the Compositional Structure of Vector Representations with Role Learning Networks (soulos, mccoy, linzen, & smolensky, 2019)  extend DISCOVER to learned roles with an LSTM
 role vector is regularized to be onehot
 Concepts and Compositionality: In Search of the Brain’s Language of Thought (frankland & greene, 2020)
 Fodor’s classic language of thought hypothesis: our minds employ an amodal, languagelike system for combining and recombining simple concepts to form more complex thoughts
 combinatorial processes engage a common set of brain regions, typically housed throughout the brain’s default mode network (DMN)
sparse coding and predictive coding
 eigenface  Turk and Pentland 1991
 eigenvectors of the input covariance matrix are good features
 can represent images using sum of eigenvectors (orthonormal basis)
 suppose you use only first M principal eigenvectors
 then there is some noise
 can use this for compression
 not good for local components of an image (e.g. parts of face, local edges)
 if you assume Gausian noise, maximizing likelihood = minimizing squared error
 generative model
 images X
 causes
 likelihood $P(X=xC=c)$
 Gaussian
 proportional to $\exp(xGc)$
 want posterior $P(CX)$
 prior $p(C)$
 assume priors causes are independent
 want sparse distribution
 has heavy tail (superGaussian distribution)
 then P(C ) = $k\prod \exp(g(C_i))$
 can implement sparse coding in a recurrent neural network
 Olshausen & Field, 1996  learns receptive fields in V1
 sparse coding is a special case of predictive coding
 there is usually a feedback connection for every feedforward connection (Rao & Ballard, 1999)
 recurrent sparse reconstruction (shi…joshi, darrell, wang, 2022)  sparse reconstruction (of a single image) learns a layer that does better than selfattention
sparse, distributed coding

\[\underset {\mathbf{D}} \min \underset t \sum \underset {\mathbf{h^{(t)}}} \min \mathbf{x^{(t)}}  \mathbf{Dh^{(t)}}_2^2 + \lambda \mathbf{h^{(t)}}_1\]
 $D$ is like autoencoder output weight matrix
 h is more complicated  requires solving inner minimization problem
 outer loop is not quite lasso  weights are not what is penalized
 barlow 1972: want to represent stimulus with minimum active neurons
 neurons farther in cortex are more silent
 v1 is highly overcomplete (dimensionality expansion)
 codes: dense > sparse, distributed $n \choose k$ > local (grandmother cells)
 energy argument  bruno doesn’t think it’s a big deal (could just not have a brain)
 PCA: autoencoder when you enforce weights to be orthonormal
 retina must output encoded inputs as spikes, lower dimension > uses whitening
 cortex
 sparse coding different kind of autencoder bottleneck (imposes sparsity)
 using bottlenecks in autoencoders forces you to find structure in data
 v1 simplecell receptive fields are localized, oriented, and bandpass
 higherorder image statistics
 phase alignment
 orientation (requires at least 3 points stats (like orientation)
 motion
 how to learn sparse repr?
 foldiak 1990 forming sparse reprs by local antihebbian learning
 driven by inputs and gets lateral inhibition and sum threshold
 neurons drift towards some firing rate naturally (adjust threshold naturally)
 use higherorder statistics
 projection pursuit (field 1994)  maximize nongaussianity of projections
 CLT says random projections should look gaussian
 gaborfilter response histogram over natural images look nonGaussian (sparse)  peaked at 0
 doesn’t work for graded signals
 projection pursuit (field 1994)  maximize nongaussianity of projections
 sparse coding for graded signals: olshausen & field, 1996
 $\underset{Image}{I(x, y)} = \sum_i a_i \phi_i (x, y) + \epsilon (x,y)$

loss function $\frac{1}{2} I  \phi a ^2 + \lambda \sum_i C(a_i)$  can think about difference between $L_1$ and $L_2$ as having preferred directions (for the same length of vector)  prefer directions which some zeros
 in terms of optimization, smooth near zero
 there is a network implementation
 $a_i$are calculated by solving optimization for each image, $\phi$ is learned more slowly
 can you get $a_i$ closed form soln?
 wavelets invented in 1980s/1990s for sparsity + compression
 these tuning curves match those of real v1 neurons
 applications
 for time, have spatiotemporal basis where local wavelet moves
 sparse coding of natural sounds
 audition like a movie with two pixels (each ear sounds independent)
 converges to gamma tone functions, which is what auditory fibers look like
 sparse coding to neural recordings  finds spikes in neurons
 learns that different layers activate together, different frequencies come out
 found place cell bases for LFP in hippocampus
 nonnegative matrix factorization  like sparse coding but enforces nonnegative
 can explicitly enforce nonnegativity
 LCA algorithm lets us implement sparse coding in biologically plausible local manner
 explaining away  neural responses at the population should be decodable (shouldn’t be ambiguous)
 good project: understanding properties of sparse coding bases

SNR = $VAR(I) / VAR( I \phi A )$  can run on data after whitening
 graph is of power vs frequency (images go down as $1/f$), need to weighten with f
 don’t whiten highest frequencies (because really just noise)
 need to do this softly  roughly what the retina does
 as a result higher spatial frequency activations have less variance
 whitening effect on sparse coding
 if you don’t whiten, have some directions that have much more variance
 projects
 applying to different types of data (ex. auditory)
 adding more bases as time goes on
 combining convolution w/ sparse coding?
 people didn’t see sparsity for a while because they were using very specific stimuli and specific neurons
 now people with less biased sampling are finding more sparsity
 in cortex anasthesia tends to lower firing rates, but opposite in hippocampus
selforganizing maps = kohonen maps
 homunculus  3d map corresponds to map in cortex (sensory + motor)
 related to selforganizing maps = kohonen maps
 in selforganizing maps, update other neurons in the neighborhood of the winner
 update winner closer
 update neighbors to also be closer
 ex. V1 has orientation preference maps that do this
 visual cortex
 visual cortex mostly devoted to center
 different neurons in same regions sensitive to different orientations (changing smoothly)
 orientation constant along column
 orientation maps not found in mice (but in cats, monkeys)
 direction selective cells as well
 maps are plastic  cortex devoted to particular tasks expands (not passive, needs to be active)
 kids therapy with tonetracking video games at higher and higher frequencies
probabilistic models + inference
 Wiener filter: has Gaussian prior + likelihood

gaussians are everywhere because of CLT, max entropy (subject to power constraint)
 for gaussian function, $d/dx f(x) = x f(x)$
boltzmann machines
 hinton & sejnowski 1983
 starts with a hopfield net (states $s_i$ weights $\lambda_{ij}$) where states are $\pm 1$
 define energy function $E(\mathbf{s}) =  \sum_{ij} \lambda_{ij} s_i s_j$
 assume Boltzmann distr $P(s) = \frac{1}{z} \exp ( \beta \phi(s))$
 learning rule is basically expectation over data  expectation over model
 could use wakesleep algorithm
 during day, calculate expectation over data via Hebbian learning (in Hopfield net this would store minima)
 during night, would run antihebbian by doing random walk over network (in Hopfield net this would remove spurious local minima)
 learn via gibbs sampling (prob for one node conditioned on others is sigmoid)
 can add hidden units to allow for learning higherorder interactions (not just pairwise)
 restricted boltzmann machine: no connections between “visible” units and no connections between “hidden units”
 computationally easier (sampling is independent) but less rich
 stacked rbm: hinton & salakhutdinov (hinton argues this is first paper to launch deep learning)
 don’t train layers jointly
 learn weights with rbms as encoder
 then decoder is just transpose of weights
 finally, run finetuning on autoencoder
 able to separate units in hidden layer
 cool  didn’t actually need decoder
 in rbm
 when measuring true distr, don’t see hidden vals
 instead observe visible units and conditionally sample over hidden units

$P(h v) = \prod_i P(h_i v)$ ~ easy to sample from

when measuring sampled distr., just sample $P(h v)$ then sample $P(v h)$
 when measuring true distr, don’t see hidden vals
 ising model  only visible units
 basically just replicates pairwise statistics (kind of like pca)
 pairwise statistics basically say “when I’m on, are my neighbors on?”
 need 3point statistics to learn a line
 basically just replicates pairwise statistics (kind of like pca)
 generating textures
 learn the distribution of pixels in 3x3 patches
 then maximize this distribution  can yield textures
 reducing the dimensionality of data with neural networks
datadriven neuroscience

https://medium.com/thespike/aneuraldatasciencehowandwhyd7e3969086f2
data types
EEG  ECoG  Local Field potential (LFP) > microelectrode array  singleunit  calcium imaging  fMRI  

scale  high  high  low  tiny  low  high 
spatial res  very low  low  midlow  x  low  midlow 
temporal res  midhigh  high  high  super high  high  very low 
invasiveness  non  yes (under skull)  very  very  non  non 

pro bigdata
Artificial neural networks can compute in several different ways. There is some evidence in the visual system that neurons in higher layers of visual areas can, to some extent, be predicted linearly by higher layers of deep networks (yamins2014performance)
 when comparing energyefficiency, must normalize network performance by energy / number of computations / parameters
anti bigdata
 could neuroscientist understand microprocessor (jonas & kording, 2017)
 System Identification of Neural Systems: If We Got It Right, Would We Know? (han, poggio, & cheung, 2023)  could functional similarity be a reliable predictor of architectural similarity?
 Can a biologist fix a radio?—Or, what I learned while studying apoptosis (lazebnik, 2002)
 no canonical microcircuit
 cellular
 extracellular microeelectrodes
 intracellular microelectrode
 neuropixels
 optical
 calcium imaging / fluorescence imaging
 wholebrain light sheet imaging
 voltagesensitive dyes / voltage imaging
 adaptive optics
 fNRIS  like fMRI but cheaper, allows more immobility, slightly worse spatial res
 oct  noninvasive  can look at retina (maybe find biomarkers of alzheimer’s)
 fiber photometry  optical fiber implanted delivers excitation light
 highlevel
 EEG/ECoG
 MEG
 fMRI/PET
 MRI with millisecond temporal precision
 molecular fmri (bartelle)
 MRS
 eventrelated optical signal = nearinfrared spectroscopy
 implantable
 neural dust
interventions
 optogenetic stimulation
 tms
 geneticallytargeted tms: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4846560/
 ect  Electroconvulsive Therapy (sometimes also called electroshock therapy)
 Identifying Recipients of Electroconvulsive Therapy: Data From Privately Insured Americans (wilkinon…roenheck, 2018)  100k ppl per year
 can differ in its application in three ways
 electrode placement
 used to be bilateral, now unilateral is more popular
 electrical waveform of the stimulus
 used to be sinusoid, now brief pulse is more popular (has gotten briefer over time)
 treatment frequency
 electrode placement
 public perception is largely negative (owing in large part to its portrayal in One flew over the Cuckoo’s nest)
 research
 How Does Electroconvulsive Therapy Work? Theories on its Mechanism (bolwig, 2011)
 generalized seizures
 normalization of neuroendocrine dysfunction in melancholic depression
 increased hippocampal neurogenesis and synaptogenesis
 Electroconvulsive therapy: How modern techniques improve patient outcomes (tirmizi, 2012)
 The neurobiological effects of electroconvulsive therapy studied through magnetic resonance – what have we learnt and where do we go? (ousdal et al. 2022)
 Clinical EEG slowing induced by electroconvulsive therapy is better described by increased frontal aperiodic activity (mith…soltani, 2023)
 How Does Electroconvulsive Therapy Work? Theories on its Mechanism (bolwig, 2011)
 local microstimulation with invasive electrodes
datasets

EEG

Brennan & Hale, 2019: 33 subjects recorded with EEG, listening to 12 min of a book chapter, no repeated session

Broderick et al. 2018: 9–33 subjects recorded with EEG, conducting different speech tasks, no repeated sessions


NLP fMRI
 A natural language fMRI dataset for voxelwise encoding models (lebel, … huth, 2022)
 8 participants listening to ~6 hours each of the moth radio hour
 3 of the particpants have ~20 hours (~95 stories, 33k timepoints)
 Narratives Dataset (Nastase et al. 2019)  more subjects, less data per subject
 345 subjects, 891 functional scans, and 27 diverse stories of varying duration totaling ~4.6 hours of unique stimuli (~43,000 words) and total collection time is ~6.4 days
 Schoffelen et al. 2019: 100 subjects recorded with fMRI and MEG, listening to decontextualised sentences and word lists, no repeated session
 Huth et al. 2016 released data from one subject
 Visual and linguistic semantic representations are aligned at the border of human visual cortex (popham, huth et al. 2021)  compared semantic maps obtained from two functional magnetic resonance imaging experiments in the same participants: one that used silent movies as stimuli and another that used narrative stories (data link)
 A natural language fMRI dataset for voxelwise encoding models (lebel, … huth, 2022)

NLP MEG datasets

MEGMASC (gwilliams…king, 2023)  27 Englishspeaking subjects MEG, each ~2 hours of story listening, punctuated by random word lists and comprehension questions in the MEG scanner. Usually each subject listened to four distinct fictional stories twice

WUMinn human connectome project (van Essen et al. 2013)  72 subjects recorded with fMRI and MEG as part of the Human Connectome Project, listening to 10 minutes of short stories, no repeated session

Armeni et al. 2022: 3 subjects recorded with MEG, listening to 10 h of Sherlock Holmes, no repeated session

 nonhuman primate optogenetics datasets
 vision dsets
 MRNet: knee MRI diagnosis
 datalad lots of stuff

springer 10k calcium imaging recording: https://figshare.com/articles/Recordings_of_ten_thousand_neurons_in_visual_cortex_during_spontaneous_behaviors/6163622

springer 2: 10k neurons with 2800 images

stringer et al. data

10000 neurons from visual cortex

 neuropixels probes
 10k neurons visual coding from allen institute
 this probe has also been used in macaques
 allen institute calcium imaging
 An experiment is the unique combination of one mouse, one imaging depth (e.g. 175 um from surface of cortex), and one visual area (e.g. “Anterolateral visual area” or “VISal”)
 predicting running, facial cues
 dimensionality reduction
 enforcing bottleneck in the deep model
 how else to do dim reduction?
 dimensionality reduction
 overview: http://www.scholarpedia.org/article/Encyclopedia_of_computational_neuroscience
 keeping up to date: https://sanjayankur31.github.io/planetneuroscience/
 lots of good data: http://home.earthlink.net/~perlewitz/index.html

connectome
 fly brain: http://temca2data.org/
 models
 senseLab: https://senselab.med.yale.edu/
 modelDB  has NEURON code
 model databases: http://www.cnsorg.org/modeldatabase
 comp neuro databases: http://home.earthlink.net/~perlewitz/database.html
 senseLab: https://senselab.med.yale.edu/
 raw misc data
 crcns data: http://crcns.org/
 visual cortex data (gallant)
 hippocampus spike trains
 allen brain atlas: http://www.brainmap.org/
 includes calciumimaging dataset: http://help.brainmap.org/display/observatory/Data++Visual+Coding
 wikipedia page: https://en.wikipedia.org/wiki/List_of_neuroscience_databases
 crcns data: http://crcns.org/
 human fMRI datasets: https://docs.google.com/document/d/1bRqfcJOV7U4faa3h8yPBjYQoLXYLLgeY6_af_N2CTM/edit
 Kay et al 2008 has data on responses to images

calcium imaging for spike sorting: http://spikefinder.codeneuro.org/
 spikes: http://www2.le.ac.uk/departments/engineering/research/bioengineering/neuroengineeringlab/software
 More datasets available at openneuro and visual cortex data on crcns
misc ideas
 could a neuroscientist understand a deep neural network?  use neural tracing to build up wiring diagram / function
 predictiondriven dimensionality reduction
 deep heuristic for modelbuilding
 joint prediction of different input/output relationships
 joint prediction of neurons from other areas
eeg
 directly model time series
 BENDR: using transformers and a contrastive selfsupervised learning task to learn from massive amounts of EEG data (kostas…rudzicz, 2021)
 NeuroGPT: Developing A Foundation Model for EEG (cui…leahy, 2023)
 model frequency bands
 EEG foundation model: Learning TopologyAgnostic EEG Representations with GeometryAware Modeling (yi…dongsheng li, 2023)
 Strong Prediction: Language Model Surprisal Explains Multiple N400 Effects (michaelov…coulson, 2024)
fMRI
nlp
 Mapping Brains with Language Models: A Survey (Karamolegkou et al. 2023)
 interpreting brain encoding models
 Brains and algorithms partially converge in natural language processing (caucheteux & king, 2022)
 best brainmapping are obtained from the middle layers of DL models
 whether an algorithm maps onto the brain primarily depends on its ability to predict words context
 average ROIs across many subjects
 test “compositionality” of features
 Tracking the online construction of linguistic meaning through negation (zuanazzi, …, remiking, poeppel, 2022)
 Blackbox meets blackbox: Representational Similarity and Stability Analysis of Neural Language Models and Brains (abnar, … zuidema, emnlp workshop, 2019)  use RSA to compare representations from language models with fMRI data from Wehbe et al. 2014
 Evidence of a predictive coding hierarchy in the human brain listening to speech (caucheteux, gramfot, & king, 2023)
 Brains and algorithms partially converge in natural language processing (caucheteux & king, 2022)

encoding models
 Seminal languagesemantics fMRI study (Huth…Gallant, 2016)  build mapping of semantic concepts across cortex using word vecs
 (caucheteux, gramfort, & king, facebook, 2022)  predicts fMRI with gpt2 on the narratives dataset
 GPT‐2 representations predict fMRI response + extent to which subjects understand corresponding narratives
 compared different encoding features: phoneme, word, gpt2 layers, gpt2 attention sizes
 brain mapping finding: auditory cortices integrate information over short time windows, and the frontoparietal areas combine supralexical information over long time windows
 gpt2 models predict brain responses well (caucheteux & king, 2021)
 Disentangling syntax and semantics in the brain with deep networks (caucheteux, gramfort, & king, 2021)  identify which brain networks are involved in syntax, semantics, compositionality
 Incorporating Context into Language Encoding Models for fMRI (jain & huth, 2018)  LSTMs improve encoding model
 The neural architecture of language: Integrative modeling converges on predictive processing (schrimpf, .., tenenbaum, fedorenko, 2021)  transformers better predict brain responses to natural language (and larger transformers predict better)

[Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data Neurobiology of Language](https://direct.mit.edu/nol/article/doi/10.1162/nol_a_00087/113632/PredictiveCodingorJustFeatureDiscoveryAn) (antonello & huth, 2022)  LLM brain encoding performance correlates not only with their perplexity, but also generality (skill at many different tasks) and translation performance
 Prediction with RNN beats ngram models on individualsentence fMRI prediction (anderson…lalor, 2021)
 Interpret transformerbased models and find top predictions in specific regions, like left middle temporal gyrus (LMTG) and left occipital complex (LOC) (sun et al. 2021)
 LexicalSemantic Content, Not Syntactic Structure, Is the Main Contributor to ANNBrain Similarity of fMRI Responses in the Language Network (kauf…andreas, fedorenko, 2024)  lexical semantic sentence content, not syntax, drive alignment.
 Artificial Neural Network Language Models Predict Human Brain Responses to Language Even After a Developmentally Realistic Amount of Training (hosseini…fedorenko, 2024)  models trained on a developmentally plausible amount of data (100M tokens) already align closely with human benchmarks
 changing experimental design
 Semantic representations during language comprehension are affected by context (i.e. how langauge is presented) (deniz…gallant, 2021)  stimuli with more context (stories, sentences) evoke better responses than stimuli with little context (Semantic Blocks, Single Words)
 Combining computational controls with natural text reveals new aspects of meaning composition (toneva, mitchell, & wehbe, 2022)  study word interactions by using encoding vector emb(phrase)  emb(word1)  emb(word2)…
 Driving and suppressing the human language network using large language models (tuckute, …, shrimpf, kay, & fedorenko, 2023)
 use encoding models to sort thousands of sentences and then show them
 alternatively, use gradientbased modifications to transform a random sentence to elicit larger responses, but this works worse
 surprisal and well formedness of linguistic input are key determinants of response strength in the language network
 use encoding models to sort thousands of sentences and then show them
 decoding models
 Semantic reconstruction of continuous language from noninvasive brain recordings (lebel, jain, & huth, 2022)  reconstruct continuous natural language from fMRI
 Decoding speech from noninvasive brain recordings (defossez, caucheteux, …, remiking, 2022)
 The generalizability crisis (yarkoni, 2020)  there is widespread difficulty in converting informal verbal hypotheses into quantitative models
 Semantic reconstruction of continuous language from noninvasive brain recordings (lebel, jain, & huth, 2022)  reconstruct continuous natural language from fMRI
vision
 Reconstructing Visual Experiences from Brain Activity Evoked by Natural Movies (nishimoto, …, gallant, 2011)
 Brain Decoding: Toward Real Time Reconstruction of Visual Perception (Benchetrit…king, 2023)  use MEG to do visual reconstruction
 Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding (chen et al. 2022)
 Aligning brain functions boosts the decoding of visual semantics in novel subjects (thual…king, 2023)  align across subjects before doing decoding
 A variational autoencoder provides novel, datadriven features that explain functional brain representations in a naturalistic navigation task (cho, zhang, & gallant, 2023)
advanced topics
highdimensional (hyperdimensional) computing
computing with random highdim vectors (also known as vectorsymbolic architectures)
Good overview website: https://www.hdcomputing.com
 ovw talk (kanerva, 2022)
 has slide with related references
 A comparison of vector symbolic architectures (schlegel et al. 2021)
motivation
 highlevel overview
 draw inspiration from circuits not single neurons
 the brain’s circuits are highdimensional
 elements are stochastic not deterministic
 no 2 brains are alike yet they exhibit the same behavior
 basic question of comp neuro: what kind of computing can explain behavior produced by spike trains?
 recognizing ppl by how they look, sound, or behave
 learning from examples
 remembering things going back to childhood
 communicating with language
operations
 ex. vectors $A$, $B$ both $\in { +1, 1}^{10,000}$ (also extends to real / complex vectors)
 3 operations
 addition: A + B = (0, 0, 2, 0, 2,2, 0, ….)
 alternatively, could take mean
 multiplication: A * B = (1, 1, 1, 1, 1, 1, 1, …)  this is XOR
 want this to be invertible, distribute over addition, preserve distance, and be dissimilar to the vectors being multiplied
 number of ones after multiplication is the distance between the two original vectors
 can represent a dissimilar set vector by using multiplication
 permutation: shuffles values (like bitshift)
 ex. rotate (bit shift with wrapping around)
 multiply by rotation matrix (where each row and col contain exactly one 1)
 can think of permutation as a list of numbers 1, 2, …, n in permuted order
 many properties similar to multiplication
 random permutation randomizes
 addition: A + B = (0, 0, 2, 0, 2,2, 0, ….)
 secondary operations
 weighting by a scalar
 similarity = dot product (sometimes normalized)
 A $\cdot$ A = 10k
 A $\cdot$ A = 0 (orthogonal)
 in highdim spaces, almost all pairs of vectors are dissimilar A $\cdot$ B = 0
 goal: similar meanings should have large similarity
 normalization
 for binary vectors, just take the sign
 for nonbinary vectors, scalar weight
 fractional binding  can bind different amounts rather than binary similar / dissimilar
data structures
the operations above allow for encoding many normal data structures into a single vector
 set  can be represented with a sum (since the sum is similar to all the vectors)
 can find a stored set using any element
 if we don’t store the sum, can probe with the sum and keep subtracting the vectors we find
 multiset = bag (stores set with frequency counts)  can store things with order by adding them multiple times, but hard to actually retrieve frequencies
 sequence  could have each element be an address pointing to the next element
 problem  hard to represent sequences that share a subsequence (could have pointers which skip over the subsquence)
 soln: index elements based on permuted sums
 can look up an element based on previous element or previous string of elements  could do some kind of weighting also
 pairs  could just multiply (XOR), but then get some weird things, e.g. A * A = 0
 instead, permute then multiply
 can use these to index (address, value) pairs and make more complex data structures
 named tuples  have smth like (name: x, date: m, age: y) and store as holistic vector $H = N*X + D * M + A * Y$
 individual attribute value can be retrieved using vector for individual key
 representation substituting is a little trickier….
 we blur what is a value and what is a variable
 can do this for a pair or for a named tuple with new values
 this doesn’t always work
examples
 ex. semantic word vectors
 goal: get good semantic vectors for words
 baseline (e.g. latentsemantic analysis LSA): make matrix of word counts, where each row is a word, and each column is a document
 add counts to each column – row vector becomes semantic vector
 HD computing alternative: each row is a word, but each document is assigned a few ~10 columns at random
 the number of columns doesn’t scale with the number of documents
 can also do this randomness for the rows (so the number of rows < the number of words)
 can still get semantic vector for a row/column by adding together the rows/columns which are activated by that row/column
 goal: get good semantic vectors for words
 ex. semantic word vectors 2 (like word2vec)
 each word in vocab is given 2 vectors
 randomindexing vector  fixed random from the beginning
 semantic vector  starts at 0
 as we traverse sequence, for each word, add randomindexing vector from words right before/after it to its semantic vector
 can also permute them before adding to preserve word order (e.g. permutations as a means to encode order in word space (kanerva, 2008))
 can instead use placeholder vector to help bring in word order (e.g. BEAGLE  Jones & Mewhort, 2007)
 can also permute them before adding to preserve word order (e.g. permutations as a means to encode order in word space (kanerva, 2008))
 each word in vocab is given 2 vectors
 ex. learning rules by example
 particular instance of a rule is a rule (e.g mothersonbaby $\to$ grandmother)
 as we get more examples and average them, the rule gets better
 doesn’t always work (especially when things collapse to identity rule)
 particular instance of a rule is a rule (e.g mothersonbaby $\to$ grandmother)
 ex. what is the dollar of mexico? (kanerva, 2010)
 initialize US = (NAME * USA) + (MONEY * DOLLAR)
 initialize MEXICO = (NAME * MEXICO) + (MONEY * PESO)
 query: “Dollar of Mexico”? = DOLLAR * US * MEXICO = PESO
 ex. text classification (najafabadi et al. 2016)
 ex. language classification  “Language Recognition using Random Indexing” (joshi et al. 2015)
 scalable, easily use anyorder ngrams
 data
 train: given million bytes of text per language (in the same alphabet)
 test: new sentences for each language
 training: compute a 10k profile vector for each language and for each test sentence
 could encode each letter with a seed vector which is 10k
 instead encode trigrams with rotate and multiply
 1st letter vec rotated by 2 * 2nd letter vec rotated by 1 * 3rd letter vec
 ex. THE = r(r(T)) * r(H) * r(E)
 approximately orthogonal to all the letter vectors and all the other possible trigram vectors…
 profile = sum of all trigram vectors (taken sliding)
 ex. banana = ban + ana + nan + ana
 profile is like a histogram of trigrams
 testing
 compare each test sentence to profiles via dot product
 clusters similar languages
 can query the letter most likely to follow “TH”
 form query vector $Q = r(r(T)) * r(H)$
 query by using multiply $X = Q$ * englishprofilevec
 find closest letter vecs to $X$: yields “e”
details
 frequent “stopwords” should be ignored
 mathematical background
 randomly chosen vecs are dissimilar
 sum vector is similar to its argument vectors
 product vector and permuted vector are dissimilar to their argument vectors
 multiplication distibutes over addition
 permutation distributes over both additions and multiplication
 multiplication and permutations are invertible
 addition is approximately invertible
 comparison to DNNs
 both do statistical learning from data
 data can be noisy
 both use highdim vecs although DNNs get bad with him dims (e.g. 100k)
 new codewords are made from existing ones
 HD memory is a separate func
 HD algos are transparent, incremental (online), scalable
 somewhat closer to the brain…cerebellum anatomy seems to better match HD
 HD: holistic (distributed repr.) is robust
HD papers
HDComputing Github Repos (see torchhd)
 HD computing overview paper (Kanerva, 2009)
 in these high dimensions, most points are close to equidistant from one another (L1 distance), and are approximately orthogonal (dot product is 0)
 memory
 heteroassociative  can return stored X based on its address A
 autoassociative  can return stored X based on a noisy version of X (since it is a point attractor), maybe with some iteration
 this adds robustness to the memory
 this also removes the need for addresses altogether
 Classification and Recall With Binary Hyperdimensional Computing: Tradeoffs in Choice of Density and Mapping Characteristics (kleyko et al. 2018)
 note: for sparse vectors, might need some threshold before computing mean (otherwise will have too many zeros)
 Neural Statistician (Edwards & Storkey, 2016) summarises a dataset by averaging over their embeddings
 kanerva machine (yu…lillicrap, 2018)
 like a VAE where the prior is derived from an adaptive memory store
 theory of sequence indexing and working memory in RNNs
 trying to make keyvalue pairs
 VSA as a structured approach for understanding neural networks
 reservoir computing = statedependent network = echosstate network = liquid state machine  try to represen sequential temporal data  builds representations on the fly
 different names
 Tony plate: holographic reduced representation (1995)
 related to TPR by paul smolensky
 ross gayler: multiplyaddpermute arch
 gayler & levi: vectorsymbolic arch
 gallant & okaywe: matrix binding with additive terms
 fourier holographic reduced reprsentations (FHRR; Plate)
 …many more names
 Tony plate: holographic reduced representation (1995)
 connecting to DNNs
 Attention Approximates Sparse Distributed Memory https://arxiv.org/pdf/2111.05498.pdf
 The Kanerva Machine: A Generative Distributed Memory (wu…lillicrap, 2018)
dynamic routing between capsules
 hinton 1981  reference frames require structured representations
 mapping units vote for different orientations, sizes, positions based on basic units
 mapping units gate the activity from other types of units  weight is dependent on if mapping is activated
 topdown activations give info back to mapping units
 this is a hopfield net with threeway connections (between input units, output units, mapping units)
 reference frame is a key part of how we see  need to vote for transformations
 olshausen, anderson, & van essen 1993  dynamic routing circuits
 ran simulations of such things (hinton said it was hard to get simulations to work)
 learn things in objectbased reference frames
 inputs > outputs has weight matrix gated by control
 zeiler & fergus 2013  visualizing things at intermediate layers  deconv (by dynamic routing)
 save indexes of max pooling (these would be the control neurons)
 when you do deconv, assign max value to these indexes
 arathom 02  mapseeking circuits
 tenenbaum & freeman 2000  bilinear models
 trying to separate content + style
 hinton et al 2011  transforming autoencoders  trained neural net to learn to shift imge
 sabour et al 2017  dynamic routing between capsules
 units output a vector (represents info about reference frame)
 matrix transforms reference frames between units
 recurrent control units settle on some transformation to identify reference frame
 notes from this blog post
 problems with cnns
 pooling loses info
 don’t account for spatial relations between image parts
 can’t transfer info to new viewpoints
 capsule  vector specifying the features of an object (e.g. position, size, orientation, hue texture) and its likelihood
 ex. an “eye” capsule could specify the probability it exists, its position, and its size
 magnitude (i.e. length) of vector represents probability it exists (e.g. there is an eye)
 direction of vector represents the instantiation parameters (e.g. position, size)
 hierarchy
 capsules in later layers are functions of the capsules in lower layers, and since capsule has extra properties can ask questions like “are both eyes similarly sized?”
 equivariance = we can ensure our net is invariant to viewpoints by checking for all similar rotations/transformations in the same amount/direction
 active capsules at one level make predictions for the instantiation parameters of higherlevel capsules
 when multiple predictions agree, a higherlevel capsule is activated
 capsules in later layers are functions of the capsules in lower layers, and since capsule has extra properties can ask questions like “are both eyes similarly sized?”
 steps in a capsule (e.g. one that recognizes faces)
 receives an input vector (e.g. representing eye)
 apply affine transformation  encodes spatial relationships (e.g. between eye and where the face should be)
 applying weighted sum by the C weights, learned by the routing algorithm
 these weights are learned to group similar outputs to make higherlevel capsules
 vectors are squashed so their magnitudes are between 0 and 1
 outputs a vector
 problems with cnns
hierarchical temporal memory (htm, numenta)
 binary synapses and learns by modeling the growth of new synapses and the decay of unused synapses
 separates aspects of brains and neurons that are essential for intelligence from those that depend on brain implementation
 terminology changed from HTM to Thousand Brains Theory
necortical structure
 evolution yields physical/logical hierarchy of brain regions
 neocortex is like a flat sheet
 neocortex regions are similar and do similar computation
 Mountcastle 1978: vision regions are vision becase they receive visual input
 number of regions / connectivity seems to be genetic
 before necortex, brain regions were homogenous: spinal cord, brain stem, basal ganglia, …
principles
 common algorithms accross neocortex
 hierarchy
 sparse distributed representations (SDR)  vectors with thousands of bits, mostly 0s
 bits of representation encode semantic properties
 ex. kwinnertakeall layer (needs boosting term to favor units that are activated less frequently so that a few neurons don’t take all the info)
 inputs
 data from the senses
 copy of the motor commands
 “sensorymotor” integration  perception is stable while the eyes move
 patterns are constantly changing
 necortex tries to control old brain regions which control muscles
 learning: region accepts stream of sensory data + motor commands
 learns of changes in inputs
 ouputs motor commands
 only knows how its output changes its input
 must learn how to control behavior via associative linking
 sensory encoders  takes input and turnes it into an SDR
 engineered systems can use nonhuman senses
 behavior needs to be incorporated fully
 temporal memory  is a memory of sequences
 everything the neocortex does is based on memory and recall of sequences of patterns
 online learning
 prediction is compared to what actually happens and forms the basis of learning
 minimize the error of predictions
 performance works better on chips like FPGAs compared to GPUs
papers
 A thousand brains: toward biologically constrained AI (hole & ahmad, 2021)
 A Theory of How Columns in the Neocortex Enable Learning the Structure of the World (hawkins et al. 2017)
 single column  integrate touches over time  represent objects properly
 multiple columns  integrate spatial inputs  make things fast
 like different sensory patches (e.g. different fingers or different rods on the retina) are all simultaneously voting
 network model that learns the structure of objects through movement
 object recognition
 over time individual columns integrate changing inputs to recognize complete objects
 through existing lateral connections
 within each column, neocortex is calculating a location representation
 locations relative to each other = allocentric
 much more motion involved
 “Why Neurons Have Thousands of Synapses, A Theory of Sequence Memory in Neocortex”
 learning and recalling sequences of patterns
 neuron with lots of synapses can learn transitions of patterns
 network of these can form robust memory
 How Can We Be So Dense? The Benefits of Using Highly Sparse Representations (ahmda & scheinkman, 2019)
 highdim sparse representations are more robust
 info content of sparse vectors increases with dimensionality, without introducing additional nonzeros
forgetting
 Continual Lifelong Learning with Neural Networks: A Review (parisi…kanan, wermter, 2019)
 main issues is catastrophic forgetting / stabilityplasticity dilemma
 2 types of plasticity
 Hebbian plasticity (Hebb 1949) for positive feedback instability
 compensatory homeostatic plasticity which stabilizes neural activity
 approaches: regularization, dynamic architectures (e.g. add more nodes after each task), memory replay
maximal exciting inputs
 singleneuron (macaque)

Neural population control via deep image synthesis (bashivan, kar, & dicarlo, 2019)

Energy Guided Diffusion for Generating Neurally Exciting Images (pierzchlewicz, …, tolias, sinz, 2023)

Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences (ponce…livingstone, 2019)

The DeepTune framework for modeling and characterizing neurons in visual cortex area V4 (abbasiasl, …, yu, 2018)

Compact deep neural network models of visual cortex (cowley, stan, pillow, & smith, 2023)

 XDream: Finding preferred stimuli for visual neurons using generative networks and gradientfree optimization (2020)

CORNN: Convex optimization of recurrent neural networks for rapid inference of neural dynamics (dinc…tanaka, 2023)  mouse population control
 realtime mouse v1
 Inception in visual cortex: in vivosilico loops reveal most exciting images (2018)
 Adept: Adaptive stimulus selection for optimizing neural population responses (cowley…byron yu, 2017)
 select next image using kernel regression from CNN embeddings
 pick next stimulus in closedloop (“adaptive sampling” = “optimal experimental design”) for macaque v4
 From response to stimulus: adaptive sampling in sensory physiology (2007)

find the smallest number of stimuli needed to fit parameters of a model that predicts the recorded neuron’s activity from the stimulus

maximizing firing rates via genetic algorithms
 maximizing firing rate via gradient ascent
 Adaptive stimulus optimization for sensory systems neuroscience
 2 general approaches: gradientbased approaches + genetic algorithms
 can put constraints on stimulus space
 stimulus adaptation
 might want isoresponse surfaces
 maximally informative stimulus ensembles (Machens, 2002)
 modelfitting: pick to maximize infogain w/ model params
 using fixed stimulus sets like white noise may be deeply problematic for efforts to identify nonlinear hierarchical network models due to continuous parameter confounding (DiMattina and Zhang, 2010)
 use for model selection
 Brain Diffusion for Visual Exploration: Cortical Discovery using Large Scale Generative Models (luo…wehbe, tarr, 2023)
population coding
 saxena_19_pop_cunningham: “Towards the neural population doctrine”
 correlated trialtotrial variability
 Ni et al. showed that the correlated variability in V4 neurons during attention and learning — processes that have inherently different timescales — robustly decreases
 ‘choice’ decoder built on neural activity in the first PC performs as well as one built on the full dataset, suggesting that the relationship of neural variability to behavior lies in a relatively small subspace of the state space.
 decoding
 more neurons only helps if neuron doesn’t lie in span of previous neurons
 encoding
 can train dnn goaldriven or train dnn on the neural responses directly
 testing
 important to be able to test population structure directly
 correlated trialtotrial variability
 population vector coding  ex. neurons coded for direction sum to get final direction
 reduces uncertainty
 correlation coding  correlations betweeen spikes carries extra info
 independentspike coding  each spike is independent of other spikes within the spike train
 position coding  want to represent a position
 for grid cells, very efficient
 sparse coding
 hard when noise between neurons is correlated
 measures of information
 eda
 plot neuron responses
 calc neuron covariances
interesting misc papers
 berardino 17 eigendistortions
 Fisher info matrix under certain assumptions = $Jacob^TJacob$ (pixels x pixels) where Jacob is the Jacobian matrix for the function f action on the pixels x
 most and least noticeable distortion directions corresponding to the eigenvectors of the Fisher info matrix
 gao_19_v1_repr
 don’t learn from images  v1 repr should come from motion like it does in the real world
 repr
 vector of local content
 matrix of local displacement
 why is this repr nice?
 separate reps of static image content and change due to motion
 disentangled rotations
 learning
 predict next image given current image + displacement field
 predict next image vector given current frame vectors + displacement
 friston_10_free_energy
navigation
 cognitive maps (tolman 1940s)  idea that rats in mazes learn spatial maps
 place cells (o’keefe 1971)  in the hippocampus  fire to indicate one’s current location
 remap to new locations
 grid cells (moser & moser 2005)  in the entorhinal cotex (provides inputs to the hippocampus)  not particular locations but rather hexagonal coordinate system
 grid cells fire if the mouse is in any location at the vertex (or center) of one of the hexagons
 there are grid cells with larger/smaller hexagons, different orientations, different offsets
 can look for grid cells signature in fmri: https://www.nature.com/articles/nature08704
 other places with grid celllike behavior
 eye movement task
 some evidence for grid cells in cortex
 Constantinescu, A., O’Reilly, J., Behrens, T. (2016) Organizing Conceptual Knowledge in Humans with a Gridlike Code. Science
 Doeller, C. F., Barry, C., \& Burgess, N. (2010). Evidence for grid cells in a human memory network. Nature
 some evidence for “time cells” like place cells for time
 sound frequency task https://www.nature.com/articles/nature21692
 2d “bird space” task
neuromorphic computing
 neuromorphic examples
 differential pair sigmoid yields sigmoidlike function
 can compute $tanh$ function really simply to simulate
 silicon retina
 lateral inhibition exists (gap junctions in horizontal cells)
 analog VLSI retina: centersurround receptive field is very low energy
 mead & mahowald 1989
 differential pair sigmoid yields sigmoidlike function
 computation requires energy (otherwise signals would dissipate)
 von neumann architecture: CPU  bus (data / address)  Memory
 moore’s law ending (in terms of cost, clock speed, etc.)
 ex. errors increase as device size decreases (and can’t tolerate any errors)
 neuromorphic computing
 brain ~ 20 Watts
 exploit intrinsic transistor physics (need extremely small amounts of current)
 exploit electronics laws kirchoff’s law, ohm’s law
 new materials (ex. memristor  3d crossbar array)
 can’t just do biological mimicry  need to understand the principles
locality sensitive hashing
 localitysensitive hashing is a fuzzy hashing technique that hashes similar input items into the same “buckets” with high probability
 hash collisions are maximized, rather than minimized as they are in dictionaries
 finding embeddings via DNNs is a sepcial case of this (e.g. might call it “semantic hashing”)
 random projections in the brain….doing locality sensitive hashing (basically nearest neighbors)
encoding vs decoding
 duality between encoding and decoding (e.g. for probing smth like syntax in LLM)
 esp. when things are localized like in fMRI
 Interpreting encoding and decoding models (kriegerskorte & douglas, 2019)
 Encoding and decoding in fMRI (naselaris, kay, nishimoto, & gallant, 2011)
 Causal interpretation rules for encoding and decoding models in neuroimaging (weichwald…grossewentrup, 2015)
neuroinspired ai (niAI)
neurodl reviews

Blog post (xcorr, 2023)

NeuroscienceInspired Artificial Intelligence (hassabis et al. 2017)

Catalyzing nextgeneration Artificial Intelligence through NeuroAI (zador, …bengio, dicarlo, lecun, …sejnowski, tsao, 2022)

Computational language modeling and the promise of in silico experimentation (jain, vo, wehbe, & huth, 2023)  4 experimental design examples
 compare concrete & abstract words (binder et al. 2005)
 contrastbased study of composition in 2word phrase (Bemis & Pylkkanen, 2011)
 checks for effects between group and individual (lerner et al. 2011)
 forgetting behavior using controlled manipulations (chien & honey, 2020)

Dissociating language and thought in large language models (mahowald, …, tenebaum, fedorenko, 2023)

2 competences

formal linguistic competence  knowledge of rules and patterns of a given language

functional linguistic competence  cognitive abilities required for language understanding and use in the real world, e.g. formal reasoning, world knowledge, situation modeling, communicative intent
 much of world knowledge is implied: people are much more likely to communicate new or unusual information rather than commonly known facts
 language and thought are robustly dissociable in the brain
 aphasia studies: despite the nearly complete loss of linguistic abilities, some individuals with severe aphasia have intact nonlinguistic cognitive abilities: they can play chess, compose music, solve arithmetic problems and logic puzzles, …
 fMRI studies: the language network is extremely selective for language processing: it responds robustly and reliably when people listen to, read, or generate sentences , but not when they perform arithmetic tasks, engage in logical reasoning, understand computer programs, listen to music, …

 Merrill et al. [2022]  semantic information is inprinciple learnable from language data

Piantadosi and Hill [2022]  an argument that models can genuinely learn meaning

Structured, flexible, and robust: benchmarking and improving large language models towards more humanlike behavior in outofdistribution reasoning tasks (collins…tenebaum, 2022)
 as situation gets more OOD, LLM gets worse compared to human, e.g.
Get your sofa onto the roof of your house, without using a pulley, a ladder, a crane...
 as situation gets more OOD, LLM gets worse compared to human, e.g.

recommendations
 modularity, curated data / diverse objectives, new benchmarks


Neurocompositional computing: From the Central Paradox of Cognition to a new generation of AI systems (smolensky, …, gao, 2022)

A Path Towards Autonomous Machine Intelligence (lecun 2022)

Towards NeuroAI: Introducing Neuronal Diversity into Artificial Neural Networks (2023)

A rubric for humanlike agents andNeuroAI (momennejad, 2022): 3 axes  humanlike behavior, neural plausibility, & engineering

Designing Ecosystems of Intelligence from First Principles (friston et al. 2022)

NeuroAI  A strategic opportunity for Norway and Europe (2022)

Perceptual Inference, Learning, and Attention in a Multisensory World (nopponey, 2021)

Engineering a Less Artificial Intelligence (sinz…tolias, 2019)  overview of ideas to make DNNs more brainlike
biological constraints for DNNs

Aligning DNN with brain responses
 haven’t found anything like this for NLP
 Aligning Model and Macaque Inferior Temporal Cortex Representations Improves ModeltoHuman Behavioral Alignment and Adversarial Robustness (dapello, kar, shrimpf…cox, dicarlo, 2022)  finetune CNN embedding to match monkey brain (IT electrode recordings) before making classifications
 Towards robust vision by multitask learning on monkey visual cortex (safarani…sinz, 2021)  simultaneously predict monkey v1 (electrode data) and imagenet
 Improved object recognition using neural networks trained to mimic the brain’s statistical properties  ScienceDirect (federer et al. 2020)  simultaneously train CNN to classify objects + have similar reprs to monkey electrode data
 Learning from brains how to regularize machines (li …, tolias 2019)  regularize intermediate representations using mouse v1 data (optical imaging) for image classification
 A Neurobiological Evaluation Metric for Neural Network Model Search (blanchard, …, bashivan, scheirer, 2019)  compare fMRI kernel matrix to DNN kernel matrix  find that the closer it is, the better a network is (and use this metric to perform early stopping)
 aligning with experimental/psychological data

[How Well Do Unsupervised Learning Algorithms Model Human Realtime and Lifelong Learning? OpenReview](https://openreview.net/forum?id=c0l2YolqD2T) (zhuang…dicarlo, yamins, 2022)


Biologicallyinspired DNNs (not datadriven)
 Relating transformers to models and neural representations of the hippocampal formation (whittington, warren, & behrens, 2022)
 transformers, when equipped with recurrent position encodings, replicate the pre cisely tuned spatial representations of the hippocampal formation; most notably place and grid cells
 Emergence of foveal image sampling from learning to attend in visual scenes (cheung, weiss, & olshausen, 2017)  using neural attention model, learn a retinal sampling lattice
 can figure out what parts of the input the model focuses on
 Simulating a Primary Visual Cortex at the Front of CNNs Improves Robustness to Image Perturbations (dapello…cox, dicarlo, 2020)  biologically inspired early neuralnetwork layers (gabors etc.) improve robustness of CNNs
 BrainLike Object Recognition with HighPerforming Shallow Recurrent ANNs (kubilius, schrimpt, kar, …, yamins, dicarlo, 2019)
 Combining Different V1 Brain Model Variants to Improve Robustness to Image Corruptions in CNNs (baidya, dapello, dicarlo, & marques, 2021)
 Surround Modulation: A Bioinspired Connectivity Structure for Convolutional Neural Networks (hasani, …, aghajan, 2019)  add inhibitory lateral connections in CNNs
 Biologically inspired protection of deep networks from adversarial attacks (nayebi & ganguli, 2017)  change training to get highly nonlinear, saturated neural nets
 Biological constraints on neural network models of cognitive function (pulvermuller, …, wennekers, 2021)  review on biological constraints
 Disentangling with Biological Constraints: A Theory of Functional Cell Types
 Relating transformers to models and neural representations of the hippocampal formation (whittington, warren, & behrens, 2022)
overview
explaining concepts from neuroscience to inform deep learning
The brain currently outperforms deep learning in a number of different ways: efficiency, parallel computation, not forgetting, robustness. Thus, in these areas and others, the brain can offer highlevel inspiration as well as more detailed algorithmic ideas on how to solve complex problems.
We begin with some history and perspective before further exploring these concepts at 3 levels: (1) the neuron level, (2) the network level, and (3) highlevel concepts.
brief history
The history of deep learning is intimately linked with neuroscience. In vision, the idea of hierarchical processing dates back to Hubel and Weisel
Ranges from neurallyinspired > biologically plausible
Computational neuroscientists often discuss understanding computation at Marr’s 3 levels of understanding: (1) computational, (2) algorithmic, and (3) mechanistic
cautionary notes
There are dangers in deep learning researchers constraining themselves to biologically plausible algorithms. First, the underlying hardware of the brain and modern von Neummanbased architectures is drastically different and one should not assume that the same algorithms will work on both systems. Several examples, such as backpropagation, were derived by deviating from the mindset of mimicking biology.
Second, the brain does not solve probleDangers for going too far…. One wouldn’t want to draw inspiration from the retina to put a hole in the camera.
</img> Gallery of brain failures. Example, insideout retina, V1 at back…
neuronlevel
The fundamental unit of the brain is the neuron, which takes inputs from other neurons and then provides an output.
Individual neurons perform varying computations. Some neurons have been show to linearly sum their inputs
 neurons are complicated (perceptron > … > detailed comparmental model)
For more information, see a very good review on modeling individual neurons
 converting to spikes introduces noise
 perhaps just price of longdistance communication
networklevel
Artificial neural networks can compute in several different ways. There is some evidence in the visual system that neurons in higher layers of visual areas can, to some extent, be predicted linearly by higher layers of deep networks
For the simplest intuition, here we provide an example of a canonical circuit for computing the maximum of a number of elements: the winnertakeall circuit.
Other network structures, such as that of the hippocampus are surely useful as well.
Questions at this level bear on population coding, or how groups of neurons jointly represent information.
engram  unit of cognitive information imprinted in a physical substance, theorized to be the means by which memories are stored
highlevel concepts
Key concepts differentiate the learning process. Online,
 learning
 highlevel
 attention
 memory
 robustness
 recurrence
 topology
 glial cells
 inspirations
 canonical cortical microcircuits
 nested loop architectures
 avoiding catostrophic forgetting through synaptic complexity
 learning asymmetric recurrent generative models

spiking networks (bindsnet)
 Computational Theory of Mind
 Classical associationism
 Connectionism
 Situated cognition
 Memoryprediction framework
 Fractal Theory: https://www.youtube.com/watch?v=axaH4HFzA24
 Brain sheets are made of cortical columns (about .3mm diameter, 1000 neurons / column)
 Have ~6 layers
 Brain as a Computer – Analog VLSI and Neural Systems by Mead (VLSI – very large scale integration)
 Process info
 Signals represented by potential
 Signals are amplified = gain
 Power supply
 Knowledge is not stored in knowledge of the parts, but in their connections
 Based on electrically charged entities interacting with energy barriers
 http://en.wikipedia.org/wiki/Computational_theory_of_mind
 http://scienceblogs.com/developingintelligence/2007/03/27/whythebrainisnotlikeaco/
 Brain’ storage capacity is about 2.5 petabytes (Scientific American, 2005)
 Electronics
 Voltage can be thought of as water in a reservoir at a height
 It can flow down, but the water will never reach above the initial voltage
 A capacitor is like a tank that collects the water under the reservoir
 The capacitance is the crosssectional area of the tank
 Capacitance – electrical charge required to raise the potential by 1 volt
 Conductance = 1/ resistance = mho, siemens
 We could also say the word is a computer with individuals being the processors – with all the wasted thoughts we have – the solution is probably to identify global problems and channel people’s focus towards working on them
 Brain chip: http://www.research.ibm.com/articles/brainchip.shtml
 Brains are not digital
 Brains don’t have a CPU
 Memories are not separable from processing
 Asynchronous and continuous
 Details of brain substrate matter
 Feedback and Circular Causality
 Asking questions
 Brains has lots of sensors
 Lots of cellular diversity
 NI uses lots of parallelism
 Delays are part of the computation
 http://timdettmers.com/
 problems with brain simulations
 Not possible to test specific scientific hypotheses (compare this to the large hadron collider project with its perfectly defined hypotheses)
 Does not simulate real brain processing (no firing connections, no biological interactions)
 Does not give any insight into the functionality of brain processing (the meaning of the simulated activity is not assessed)
 Neuron information processing parts
 Dendritic spikes are like first layer of conv net
 Neurons will typically have a genome that is different from the original genome that you were assigned to at birth. Neurons may have additional or fewer chromosomes and have sequences of information removed or added from certain chromosomes.
 http://timdettmers.com/2015/03/26/convolutiondeeplearning/
 The adult brain has 86 billion neurons, about 10 trillion synapse, and about 300 billion dendrites (treelike structures with synapses on them