cogsci
Contents
6.4. cogsci#
Some notes on a computational perspective on cognitive science.
6.4.1. nativism, empiricism#
cogsci “inverse problem” - give world, how do we form representations
e.g. given pixels, how do we understand world?
historical cognitive development: how do we form representations?
nativism - plato - representation w/out learning
empiricism - aristotle - learning w/out representation
constructivism - jean piaget - cognitive development
types
assimilation - start with schema, which tells you how to act
accomodation - adjust schema based on what you see
equilibration - everything set
periods
sensorimotor
preoperational
concrete operational
formal operational
0 - 18 months
18 months - 5 years
5 years - adolescence
adolescence and beyond
problems
dev. is domain-specific and variable
when measured better, children show earlier competence
doesn’t specify learning methods
contemporary theories
nativism - core knowledge or modules
plato, descartes
chomsky - language acuquisition device
spelke, tenenbaum - core knowledge of domains
constraint nativism vs. starting-state nativism
empiricism - connectionism, dynamic systems, associationism
aristotle, david hume
behaviorism - led to rl
connectionists - mclelland, karmiloff-smith
dynamic systems - thelen and smith
emergentist appproach - complex behavior is emergent from simple neural properties
john locke
deep learning
constructivism - “the theory theory” (children are scientists), probabilistic models
carey, wellman, gelman, gopnik
structural features: abstract, coherent, causal, hierarchical
theories change in response to evidence
changes may take place at multiple levels
playing
perhaps bayesian learning
information-processing - memory changes over time etc.
siegler
socio-cultural influences - children learn from society / culture
vyogotsky
unanswered questions
search problem: how do children search through all possible hypotheses? sampling?
conceptual change problem: how is radical change in representations possible
counterfactual - relating to or expressing what has not happened or is not the case
e.g. If kangaroos had no tails, they would topple over
6.4.2. probabilistic causal models#
abstract representations that provide computational accounts of causal inference
at marr’s highest level: computation
how much do we need to build in?
dna can build in things that are hard to learn
start with nothing built in, ex. deep learning (connectionism)
start with best possible learning algorithm and ask what you need to build in (bayesian modeling)
bayes rule Bayesian: \(\overbrace{p(\theta | x)}^{\text{posterior}} = \frac{\overbrace{p(x|\theta)}^{\text{likelihood}} \overbrace{p(\theta)}^{\text{prior}}}{p(x)}\) where x is data and \(\theta\) are hypotheses
ask what priors explain people’s inferences
humans make very causal priors - restricts hypothesis space of possible bayesian networks
hierarhical model - get prior from something (e.g. know all bags contain same color)
what develops over time?
bayesian doesn’t really tell us this - just has probabilities evolve over time
real life we come up with new hypotheses
what representations do we use?
6.4.3. deep rl#
6.4.3.1. dl basics#
what does an NN do?
universal function approximator
feature learning view - learn features that separate classes linearly
bhand…saxe, ng 2010 compared statistics of primary cortical receptive fields and receptive field plasticity
vision + audio + somatosensory - statistics seem to line up with neuroscience
a single algorithm?
ex. seeing with your tongue
ex. blind people echolocating
ex. ferret optic nerve rewired to auditory system and they learn to still see
lots of inductive bias? or lack of bias?
gabor filters come from deep learning, ica, sparse coding, k-means - come from data, not necessarily algorithm
maybe this is only true for gabor filters though
bigger and simpler with residuals seems to work best
sequence tasks - lstm started being replaced with “attention is all you need” (attention is basically a dot product)
differentiable neural computer didn’t seem to work all that well - tried to build in too much
some inductive bias has worked: cnn, lstm
structure
structure is very dependent on optimization
depth seems to help
compositionality
increase in representational power?
information theoretic arguments
low-data regimes: few-shot recognition doesn’t work (see Lake, Salakhutidinov, Tenenbaum 2013)
one type of meta-learning for few-shot learning
split data set into tiny train / test blocks and learn to do few-shot learning on these
got really good at this (especially for omniglot)
deep learning: less structure + more data = better results
alternative view - not necessarily dl
very unconstrained function class + right learning algorithm + right supervision = better results
optimization methods master (is sgd “conveniently” Bayesian?)
ex. everything that works works because it’s Bayesian blog post
structure of natural dat amatters
source of supervision really matters
maybe function class doesn’t really matter
6.4.3.2. supervision#
unsupervised learning: learn \(p_\theta (x)\)
overview
pull out features/representation and use them for other tasks
recognize novel / unexpected events
hallucinate
can use all available data
often has principled, prob. interpretation
hard to use effectively
deep belief networks (~2006) - built on restricted Boltzmann machine (both a NN and a graphical model)
RBM = markov field with binary variables and connections between but not within layers
originally trained layerwise
properties
each neuron is (binary) random variable, so we can sample
hard to train
unsupervised gd (2012-2014) - sparse/denoising/bottleneck autoencoder
ex. if you have linear encoder/decoder with bottleneck, then you get pca
ex. denoising - blur image and ask it to learn to clean it up
learns something about the manifold that represents a class / the correct data
ex. variational autoencders (Kingma & Welling 2013)
network encoder x->z & decoder z->x
force z to be spherical Gaussian (ex. add noise to make it spherical Gaussian (with learned variance) - KL divergence regularizer between output and spherical Gaussian)
then, we can sample z from spherical Gaussian and decoder will give us nice looking x
ex. GANs (Goodfewllow 2014)
“implicit” density model
supervised learning
representations are surprisingly good
labels provide good semantic knowledge
simple backprop training
semi-supervised = self-supervised learning
ex. given patches, tell where they are placed relative to each other (context prediction doersch et al. 2015)
prediction supervised learning (time series)
ex. predict video frames from past video frames
reakes some artistry to figure out objective
reinforcement learning
non-differentiable reward function - doesn’t access to output supervision
supervised learning is a subset of this
3 types (often we do a combination of these together)
direct policy search
prediction of value functions
prediction of future states
can collect data by just exploring world
rl needs more generalization
want more exploration algorithms
train in more settings
ex. approximate bayesian experiment design
6.4.4. objects#
deep deep cnns - not just a chain, combine different layers and maybe represent semantic concepts
iterative deep aggregation - generalizes skip connections - new layers for the skips
adaptive learning - too much dataset bias - imagenet things only work on imagenet
need to know about discrepancy between distributions
GAN - learn adversarial that decides whether a point comes from one domain or another
goal: want both domains to be separable with domains for only one
explainable vision and language agents
image captioning doesn’t mean it actually understands scene
now, people focused on visual question answering
maybe even provide interpretable explanation
preprogram 5 types of modules (ex. find, relate, count, …)
concept vs. category learning
except for NN, need negative examples
learning nouns
strong sampling - pick key examples instead of lots of random examples
bayesian learning can figure out what classes to generalize to
need to assume some ontology
it’s all about the data
face detection - made things really work
learning spectrum: extrapolation (low samples) -> interpolation (high samples - where we are)
extrapolation = linear reg. -> interpolation = nearest neighbor / neural network
nearest neighbors started to work better with more data
explainability doesn’t work if system is fundamentally complicated
natural world has too many examples for interpolation
brain doing nearest neighbors?
capacity of visual long term memory - standing 1973: 10k images, 83% recognition
clap if you see a repetition - you can really remember a lot
AB testing at end - check if you seaw things or not
novel objects: 92%, different examples of same thing: 88%, different states of same thing: 87%
we can’t do this with just textured images though
big boosts come from data, not necessarily the algorithm
word2vec works about same as nearest neighbors embedding
top-down vs bottom-up concept learning
problems with top-down / using labels
can’t ask computer about semantic stuff - it only sees pixels
cool ex. can see a chair by just looking at silhouette of person sitting in it
humans don’t categorize things in binary ways
solns: hierarchy, levels of categories…
used to have to do this (ex. books in a library), but computers don’t have to do this (ex. amazon)
still problematic
intransitivity - car seat is chair, chair is furniture, …
multiple category membership
“ontologies are overrated” blog post
this helps with interpretability, but might be worse overal
start from image (bottom) - use association not categorization
make graph - edges are associations
task: predict what’s behind some blurred out box in an image
6.4.5. language development#
lots of children have language difficulties
language might be unique to humans
language is somewhat instinctual
learned via association (John Locke) or by reading people’s intentions (St. Augustine)
behaviorist approach (Skinner) - conditioning, children learn because of adult reinforcement
ex. bird turning to get reward
“meaning” - an association between stimulus and response
progressilve taught more complex utterances
noam chomsky’s critique of skinner review (1959)
parents correct truth of utterance, not its grammar
stimulus doesn’t completely determine response
children say things they’ve never heard (e.g. mommy eated the apple)
grammar isn’t a chain of associated words
chomsky: children have innate knowledge
introduces “cognitive revolution” - explain language you need
mental representations
rules that operate on those representations
figure out what is common between all languages
ex. verb agrees with subject
mostly part of speaker’s unconscious knowledge of language
focus on ideal speaker’s competence, not their performance
ex. question formation
poverty of the stimulus - children get this info which doesn’t seem to be in the stimulus
chomsky theory of universal grammar
fixed invariant structural principles because human brain is wired to understand them
nativists - children guided by universal grammar
constructivists - innate general learning mechanisms operate on experience
children use babbling
children use pre-linguistic communication for sharing attention
some children have crib speech - talking to themselves, seems systematic
children have U-curves - use ate, then eated, then back to ate
6.4.6. theory of mind#
theory of mind - human understanding of agents’ mental states and how they shape actions
understand intentional actions - like dots, understand evil, purpose
we can track this month by month
Theory of mind
is rapidly acquired in the normal case
is acquired in an extended series of developmental accomplishments
encompasses several basic insights that are acquired world-wide on a roughly similar trajectory (but not timetable), (4) requires considerable learning and development based on an infant set of specialized abilities to attend to and represent persons, (5) is severely impaired in autism, (6) is severely delayed in deaf children of hearing parents, and (6) results from but also contributes to specialized neural substrates associated with reasoning about agency, experience, and mind
Preschool Theory of Mind, Part 1: Universal Belief-Desire Understanding
6.4.7. lesswrong#
-
cognitive bias can be thought of like statistical bias
base rate neglect: grounding one’s judgments in how well sets of characteristics feel like they fit together, and neglecting how common each characteristic is in the population at large
e.g. if you meet a shy person, are they more likely to be a salesperson or a librarian?
sunk cost fallacy
scope neglect - the number of birds saved—the scope of the altruistic action—had little effect on willingness to pay (post)
availability heuristic - judging the frequency or probability of an event by the ease with which examples of the event come to mind.
absurdity bias; events that have never happened are not recalled, and hence deemed to have probability zero.
conjunction fallacy - humans assign a higher probability to a proposition of the form “A and B” than to one of the propositions “A” or “B” in isolation
The implausibility of one claim is compensated by the plausibility of the other; they “average out.”
planning fallacy - people think they can plan e.g. “best guess” scenarios are same as “best case” scenarios
rationality
epistemic rationality - systematically improving the accuracy of your beliefs.
instrumental rationality - systematically achieving your values.
probability theory, and decision theory
System 1 and System 2—fast perceptual judgments versus slow deliberative judgments. System 2’s deliberative judgments aren’t always true, and System 1’s perceptual judgments aren’t always false; so it is very important to distinguish that dichotomy from “rationality.”