1.1. comp neuroÂ¶
1.1.2. highdimensional computingÂ¶
highlevel overview
current inspiration has all come from single neurons at a time  hd computing is going past this
the brainâs circuits are highdimensional
elements are stochastic not deterministic
can learn from experience
no 2 brains are alike yet they exhibit the same behavior
basic question of comp neuro: what kind of computing can explain behavior produced by spike trains?
recognizing ppl by how they look, sound, or behave
learning from examples
remembering things going back to childhood
communicating with language

in these high dimensions, most points are close to equidistant from one another (L1 distance), and are approximately orthogonal (dot product is 0)
memory
heteroassociative  can return stored X based on its address A
autoassociative  can return stored X based on a noisy version of X (since it is a point attractor), maybe with some iteration
this adds robustness to the memory
this also removes the need for addresses altogether
1.1.2.1. definitionsÂ¶
what is hd computing?
compute with random highdim vectors
ex. 10k vectors A, B of +1/1 (also extends to real / complex vectors)
3 operations
addition: A + B = (0, 0, 2, 0, 2,2, 0, âŠ.)
multiplication: A * B = (1, 1, 1, 1, 1, 1, 1, âŠ)  this is XOR
want this to be invertible, dsitribute over addition, preserve distance, and be dissimilar to the vectors being multiplied
number of ones after multiplication is the distance between the two original vectors
can represent a dissimilar set vector by using multiplication
permutation: shuffles values
ex. rotate (bit shift with wrapping around)
multiply by rotation matrix (where each row and col contain exactly one 1)
can think of permutation as a list of numbers 1, 2, âŠ, n in permuted order
many properties similar to multiplication
random permutation randomizes
basic operations
weighting by a scalar
similarity = dot product (sometimes normalized)
A \(\cdot\) A = 10k
A \(\cdot\) A = 0 (orthogonal)
in highdim spaces, almost all pairs of vectors are dissimilar A \(\cdot\) B = 0
goal: similar meanings should have large similarity
normalization
for binary vectors, just take the sign
for nonbinary vectors, scalar weight
data structures
these operations allow for encoding all normal data structures: sets, sequences, lists, databases
set  can represent with a sum (since the sum is similar to all the vectors)
can find a stored set using any element
if we donât store the sum, can probe with the sum and keep subtracting the vectors we find
multiset = bag (stores set with frequency counts)  can store things with order by adding them multiple times, but hard to actually retrieve frequencies
sequence  could have each element be an address pointing to the next element
problem  hard to represent sequences that share a subsequence (could have pointers which skip over the subsquence)
soln: index elements based on permuted sums
can look up an element based on previous element or previous string of elements
could do some kind of weighting also
pairs  could just multiply (XOR), but then get some weird things, e.g. A * A = 0
instead, permute then multiply
can use these to index (address, value) pairs and make more complex data structures
named tuples  have smth like (name: x, date: m, age: y) and store as holistic vector \(H = N*X + D * M + A * Y\)
individual attribute value can be retrieved using vector for individual key
representation substituting is a little trickierâŠ.
we blur what is a value and whit is a variable
can do this for a pair or for a named tuple with new values
this doesnât always work
examples
context vectors
standard practice (e.g. LSA): make matrix of word counts, where each row is a word, and each column is a document
HD computing alternative: each row is a word, but each document is assigned a few ~10 columns at random
thus, the number of columns doesnât scale with the number of documents
can also do this randomness for the rows (so the number of rows < the number of words)
can still get semantic vector for a row/column by adding together the rows/columns which are activated by that row/column
this examples still only uses bagofwords (but can be extended to more)
learning rules by example
particular instance of a rule is a rule (e.g mothersonbaby \(\to\) grandmother)
as we get more examples and average them, the rule gets better
doesnât always work (especially when things collapse to identity rule)
analogies from pairs
ex. what is the dollar of mexico?
1.1.2.2. ex. identify the languageÂ¶
paper: LANGUAGE RECOGNITION USING RANDOM INDEXING (joshi et al. 2015)
benefits  very simple and scalable  only go through data once
equally easy to use 4grams vs. 5grams
data
train: given million bytes of text per language (in the same alphabet)
test: new sentences for each language
training: compute a 10k profile vector for each language and for each test sentence
could encode each letter wih a seed vector which is 10k
instead encode trigrams with rotate and multiply
1st letter vec rotated by 2 * 2nd letter vec rotated by 1 * 3rd letter vec
ex. THE = r(r(T)) * r(H) * r(E)
approximately orthogonal to all the letter vectors and all the other possible trigram vectorsâŠ
profile = sum of all trigram vectors (taken sliding)
ex. banana = ban + ana + nan + ana
profile is like a histogram of trigrams
testing
compare each test sentence to profiles via dot product
clusters similar languages  cool!
gets 97% test acc
can query the letter most likely to follor âTHâ
form query vector \(Q = r(r(T)) * r(H)\)
query by using multiply X + Q * englishprofilevec
find closest letter vecs to X  yields âeâ
1.1.2.3. detailsÂ¶
mathematical background
randomly chosen vecs are dissimilar
sum vector is similar to its argument vectors
product vector and permuted vector are dissimilar to their argument vectors
multiplication distibutes over addition
permutation distributes over both additions and multiplication
multiplication and permutations are invertible
addition is approximately invertible
comparison to DNNs
both do statistical learning from data
data can be noisy
both use highdim vecs although DNNs get bad with him dims (e.g. 100k)
HD is founded on rich mathematical theory
new codewords are made from existing ones
HD memory is a separate func
HD algos are transparent, incremental (online), scalable
somewhat closer to the brainâŠcerebellum anatomy seems to be match HD
HD: holistic (distributed repr.) is robust
different names
Tony plate: holographic reduced representation
ross gayler: multiplyaddpermute arch
gayler & levi: vectorsymbolic arch
gallant & okaywe: matrix binding with additive termps
fourier holographic reduced reprsentations (FHRR; Plate)
âŠmany more names
theory of sequence indexing and working memory in RNNs
trying to make keyvalue pairs
VSA as a structured approach for understanding neural networks
reservoir computing = statedependent network = echosstate network = liquid state machine  try to represen sequential temporal data  builds representations on the fly
1.1.2.4. papersÂ¶
text classification (najafabadi et al. 2016)

note: for sparse vectors, might need some threshold before computing mean (otherwise will have too many zeros)
1.1.3. dnns with memoryÂ¶
Neural Statistician (Edwards & Storkey, 2016) summarises a dataset by averaging over their embeddings

like a VAE where the prior is derived from an adaptive memory store
1.1.4. visual samplingÂ¶
Emergence of foveal image sampling from learning to attend in visual scenes (cheung, weiss, & olshausen, 2017)  using neural attention model, learn a retinal sampling lattice
can figure out what parts of the input the model focuses on
1.1.5. dynamic routing between capsulesÂ¶
hinton 1981  reference frames requires structured representations
mapping units vote for different orientations, sizes, positions based on basic units
mapping units gate the activity from other types of units  weight is dependent on if mapping is activated
topdown activations give info back to mapping units
this is a hopfield net with threeway connections (between input units, output units, mapping units)
reference frame is a key part of how we see  need to vote for transformations
olshausen, anderson, & van essen 1993  dynamic routing circuits
ran simulations of such things (hinton said it was hard to get simulations to work)
we learn things in objectbased reference frames
inputs > outputs has weight matrix gated by control
zeiler & fergus 2013  visualizing things at intermediate layers  deconv (by dynamic routing)
save indexes of max pooling (these would be the control neurons)
when you do deconv, assign max value to these indexes
arathom 02  mapseeking circuits
tenenbaum & freeman 2000  bilinear models
trying to separate content + style
hinton et al 2011  transforming autoencoders  trained neural net to learn to shift imge
sabour et al 2017  dynamic routing between capsules
units output a vector (represents info about reference frame)
matrix transforms reference frames between units
recurrent control units settle on some transformation to identify reference frame
notes from this blog post
problems with cnns
pooling loses info
donât account for spatial relations between image parts
canât transfer info to new viewpoints
capsule  vector specifying the features of an object (e.g. position, size, orientation, hue texture) and its likelihood
ex. an âeyeâ capsule could specify the probability it exists, its position, and its size
magnitude (i.e. length) of vector represents probability it exists (e.g. there is an eye)
direction of vector represents the instatntiation parameters (e.g. position, size)
hierarchy
capsules in later layers are functions of the capsules in lower layers, and since capsule has extra properties can ask questions like âare both eyes similarly sized?â
equivariance = we can ensure our net is invariant to viewpoints by checking for all similar rotations/transformations in the same amount/direction
active capsules at one level make predictions for the instantiation parameters of higherlevel capsules
when multiple predictions agree, a higherlevel capsule is activated
steps in a capsule (e.g. one that recognizes faces)
receives an input vector (e.g. representing eye)
apply affine transformation  encodes spatial relationships (e.g. between eye and where the face should be)
applying weighted sum by the C weights, learned by the routing algorithm
these weights are learned to group similar outputs to make higherlevel capsules
vectors are squashed so their magnitudes are between 0 and 1
outputs a vector
1.1.6. hierarchical temporal memory (htm)Â¶
binary synapses and learns by modeling the growth of new synapses and the decay of unused synapses
separate aspects of brains and neurons that are essential for intelligence from those that depend on brain implementation
1.1.6.1. necortical structureÂ¶
evolution leads to physical/logical hierarchy of brain regions
neocortex is like a flat sheet
neocortex regions are similar and do similar computation
Mountcastle 1978: vision regions are vision becase they receive visual input
number of regions / connectivity seems to be genetic
before necortex, brain regions were homogenous: spinal cord, brain stem, basal ganglia, âŠ
1.1.6.2. principlesÂ¶
common algorithims accross neocortex
hierarchy
sparse distributed representations (SDR)  vectors with thousands of bits, mostly 0s
bits of representation encode semantic properties
inputs
data from the sense
copy of the motor commands
âsensorymotorâ integration  perception is stable while the eyes move
patterns are constantly changing
necortex tries to control old brain regions which control muscles
learning: region accepts stream of sensory data + motor commands
learns of changes in inputs
ouputs motor commands
only knows how its output changes its input
must learn how to control behavior via associative linking
sensory encoders  takes input and turnes it into an SDR
engineered systems can use nonhuman senses
behavior needs to be incorporated fully
temporal memory  is a memory of sequences
everything the neocortex does is based on memory and recall of sequences of patterns
online learning
prediction is compared to what actually happens and forms the basis of learning
minimize the error of predictions
1.1.6.3. papersÂ¶
âA Theory of How Columns in the Neocortex Enable Learning the Structure of the Worldâ
network model that learns the structure of objects through movement
object recognition
over time individual columns integrate changing inputs to recognize complete objects
through existing lateral connections
within each column, neocortex is calculating a location representation
locations relative to each other = allocentric
much more motion involved
multiple columns  integrate spatial inputs  make things fast
single column  integrate touches over time  represent objects properly
âWhy Neurons Have Thousands of Synapses, A Theory of Sequence Memory in Neocortexâ
learning and recalling sequences of patterns
neuron with lots of synapses can learn transitions of patterns
network of these can form robust memory
1.1.7. forgettingÂ¶
Continual Lifelong Learning with Neural Networks: A Review
main issues is catastrophic forgetting / stabilityplasticity dilemma
2 types of plasticity
Hebbian plasticity (Hebb 1949) for positive feedback instability
compensatory homeostatic plasticity which stabilizes neural activity
approaches: regularization, dynamic architectures (e.g. add more nodes after each task), memory replay
1.1.8. deeptunestyleÂ¶
ponce_19_evolving_stimuli: https://www.cell.com/action/showPdf?pii=S00928674%2819%29303915
bashivan_18_ann_synthesis

use kernel regression from CNN embedding to calculate distances between preset images
select preset images
verified with macaque v4 recording
currently only study that optimizes firing rates of multiple neurons
pick next stimulus in closedloop (âadaptive samplingâ = âoptimal experimental designâ)
J. Benda, T. Gollisch, C. K. Machens, and A. V. Herz, âFrom response to stimulus: adaptive sampling in sensory physiologyâ
find the smallest number of stimuli needed to fit parameters of a model that predicts the recorded neuronâs activity from the stimulus
maximizing firing rates via genetic algorithms
maximizing firing rate via gradient ascent
C. DiMattina and K. Zhang,âAdaptive stimulus optimization for sensory systems neuroscienceâ](https://www.frontiersin.org/articles/10.3389/fncir.2013.00101/full)
2 general approaches: gradientbased approaches + genetic algorithms
can put constraints on stimulus space
stimulus adaptation
might want isoresponse surfaces
maximally informative stimulus ensembles (Machens, 2002)
modelfitting: pick to maximize infogain w/ model params
using fixed stimulus sets like white noise may be deeply problematic for efforts to identify nonlinear hierarchical network models due to continuous parameter confounding (DiMattina and Zhang, 2010)
use for model selection
1.1.9. population codingÂ¶
saxena_19_pop_cunningham: âTowards the neural population doctrineâ
correlated trialtotrial variability
Ni et al. showed that the correlated variability in V4 neurons during attention and learning â processes that have inherently different timescales â robustly decreases
âchoiceâ decoder built on neural activity in the first PC performs as well as one built on the full dataset, suggesting that the relationship of neural variability to behavior lies in a relatively small subspace of the state space.
decoding
more neurons only helps if neuron doesnât lie in span of previous neurons
encoding
can train dnn goaldriven or train dnn on the neural responses directly
testing
important to be able to test population structure directly
population vector coding  ex. neurons coded for direction sum to get final direction
reduces uncertainty
correlation coding  correlations betweeen spikes carries extra info
independentspike coding  each spike is independent of other spikes within the spike train
position coding  want to represent a position
for grid cells, very efficient
sparse coding
hard when noise between neurons is correlated
measures of information
eda
plot neuron responses
calc neuron covariances
1.1.10. interesting misc papersÂ¶
berardino 17 eigendistortions
Fisher info matrix under certain assumptions = \(Jacob^TJacob\) (pixels x pixels) where Jacob is the Jacobian matrix for the function f action on the pixels x
most and least noticeable distortion directions corresponding to the eigenvectors of the Fisher info matrix
gao_19_v1_repr
donât learn from images  v1 repr should come from motion like it does in the real world
repr
vector of local content
matrix of local displacement
why is this repr nice?
separate reps of static image content and change due to motion
disentangled rotations
learning
predict next image given current image + displacement field
predict next image vector given current frame vectors + displacement
kietzmann_18_dnn_in_neuro_rvw
friston_10_free_energy