graphical models
view markdownMaterial from Russell and Norvig “Artifical Intelligence” 3rd Edition and Jordan “Graphical Models”
overview
 network types
 bayesian networks  directed
 undirected models
 latent variable types
 mixture models  discrete latent variable
 factor analysis models  continuous latent variable
 graph representation: missing edges specify independence (converse is not true)
 encode conditional independence relationships
 helpful for inference
 compact representation of joint prob. distr. over the variables
 encode conditional independence relationships
 dark is observed for HMMs, for other things unclear what it means
bayesian networks  R & N 14.15 + J 2
 examples
 causal model: causes $\to$ symptoms
 diagnostic model: symptoms $\to$ causes
 generally requires more dependencies
 learning
 expertdesigned
 datadriven
 properties
 each node is random variable
 weights as tables of conditional probabilities for all possibilities
 represented by directed acyclic graph
 joint distr: $P(X_1 = x_1,…X_n=x_n)=\prod_{i=1}^n P[X_i = x_i \vert Parents(X_i)]$
 markov condition: given parents, node is conditionally independent of its nondescendants
 marginally, they can still be dependent (e.g. explaining away)
 given its markov blanket (parents, children, and children’s parents), a node is independent of all other nodes
 markov condition: given parents, node is conditionally independent of its nondescendants

BN has no redundancy $\implies$ no chance for inconsistency
 forming a BN: keep adding nodes, and only previous nodes are allowed to be parents of new nodes
hybrid BN (both continuous & discrete vars)
 for continuous variables can sometimes discretize
 linear Gaussian  for continuous children
 parents all discrete $\implies$ conditional Gaussian  multivariate Gaussian given assignment to discrete variables
 parents all continuous $\implies$ multivariate Gaussian over all the variables, and a multivariate posterior distribution (given any evidence)
 parents some discrete, some continuous
 h is continuous, s is discrete; a, b, $\sigma$ all change when s changes

$P(c h,s) = N(a \cdot h + b, \sigma^2)$, so mean is linear function of h  discrete children (continuous parents) 1. probit distr  $P(buysCost=c) = \phi[(\mu  c)/\sigma]$  integral of standard normal distr
 like a soft threshold 2. logit distr.  $P(buysCost=c) =s\left(\frac{2 (\mu  c)}\sigma \right)$
 logistic function (sigmoid s) produces thresh
exact inference
 given assignment to evidence variables E, find probs of query variables X
 other variables are hidden variables H 1. enumeration  just try summing over all hidden variables

$P(X e) = \alpha P(X, e) = \alpha \sum_h P(X, e, h)$  $\alpha$ can be calculated as $1 / \sum_x P(x, e)$
 $O(n \cdot 2^n)$
 one summation for each of n variables
 ENUMERATIONASK evaluates in depthfirst order: $O(2^n)$
 we removed the factor of n
 variable elimination  dynamic programming (see elimination)
 we removed the factor of n

$P(B j, m) = \alpha \underbrace{P(B)}{f_1(B)} \sum_e \underbrace{P(e)}{f_2(E)} \sum_a \underbrace{P(a B,e)}_{f_3(A, B, E)} \underbrace{P(j a)}_{f_4(A)} \underbrace{P(m a)}_{f_5(A)}$  calculate factors in reverse order (bottomup)

each factor is a vector with num entries = $\prod$ num_elements * num_values  when we multiply them, pointwise products
 ordering
 any ordering works, some are more efficient
 every variable that is not an ancestor of a query variable or evidence variable is irrelevant to the query
 complexity depends on largest factor formed
 clustering algorithms = join tree algorithms (see propagation factor graphs)
 join individual nodes in such a way that resulting network is a polytree
 polytree=singlyconnected network  only 1 undirected paths between any 2 nodes
 time and space complexity of exact inference is linear in the size of the network
 holds even if the number of parents of each node is bounded by a constant
 polytree=singlyconnected network  only 1 undirected paths between any 2 nodes
 can compute posterior probabilities in $O(n)$
 however, conditional probability tables may still be exponentially large
approximate inferences in BNs
 randomized sampling algorithms = monte carlo algorithms
 direct sampling methods:
 simplest  sample network in topological order
 rejection sampling  sample in order and stop once evidence is violated

want P(D A)  sample N times, throw out samples where A is false
 return probability of D being true
 this is slow

 likelihood weighting  fix evidence to be more efficient
 generating a sample
 fix our evidence variables to their observed values, then simulate the network
 can’t just fix variables  distr. might be inconsistent
 calculate W = prob of sample being generated
 when we get to an evidence variable, multiply by prob it appears given its parents
 for each observation
 if positive, Count = Count + W
 Total = Total + W
 return Count/Total
 this way we don’t have to throw out wrong samples
 doesn’t solve all problems  evidence only influences the choice of downstream variables
 generating a sample
 Markov chain monte carlo  ex. Gibbs sampling, MetropolisHastings
 fix evidence variables
 sample a nonevidence variable $X_i$ conditioned on the current values of its Markov blanket
 repeatedly resample oneatatime in arbitrary order
 why it works
 the sampling process settles into a dynamic equilibrium where time spent in each state is proportional to its posterior probability
 provided transition matrix q is ergodic  every state is reachable and there are no periodic cycles  only 1 steadystate soln  Gibbs: we have conditionals
 MetropolisHastings we have to estimate conditionals
 variational inference  formulate inference as optimization
 good intro paper
 minimize KLdivergence between observed samples and assumed distribution
 the actual KL is hard to minimize so instead we maximize the ELBO, which is equivalent
 do this over a class of possible distrs.
 variational inference tends to be faster, but may not be as good as MCMC
conditional independence properties

multiple, competing explanations (“explainingaway”)
 in fact any descendant of the base of the v suffices for explaining away

dseparation = directed separation

Bayes ball algorithm  is $( X_A \perp X_B ) X_C$?  initialize
 shade $X_C$
 place ball at each of $X_A$
 if any ball reaches $X_B$, then not conditionally independent
 rules
 balls can’t pass through shaded unless shaded is at base of v
 balls pass through unshaded unless unshaded is at base of v
 initialize
undirected

$X_A \perp X_C X_B$ if the set of nodes $X_B$ separates the nodes $X_A$ from $X_C$  can’t convert directed / undirected
 factor over maximal cliques (largest sets of fully connected nodes)
 potential function $\psi_{X_C} (x_C)$ function on possible realizations $x_C$ of the maximal clique $X_C$
 nonnegative, but not a probability (specifying conditional probs. doesn’t work)
 commonly let these be exponential: $\psi_{X_C} (x_C) = \exp(f_C(x_C))$
 yields energy $f(x) = \sum_C f_C(x_C)$
 yields Boltzmann distribution: $p(x) = \frac{1}{Z} \exp (f(x))$
 $p(x) = \frac{1}{Z} \prod_{C \in Cliques} \psi_{X_C}(x_c)$
 $Z = \sum_x \prod_{C \in Cliques} \psi_{X_C} (x_C)$
 reduced parameterizations  impose constraints on probability distributions (e.g. Gaussian)
 if x is dependent on all its neighbors
 Ising model  if x is binary
 Potts model  x is multiclass
elimination  J 3
 the elimination algorithm is for probabilistic inference

want $p(x_F x_E)$ where E and F are disjoint  any var that is not ancestor of evidence or ancestor of query is irrelevant

 here, let $X_F$ be a single node
 notation
 define $m_i (x_{S_i})$ = $\sum_{x_i}$ where $x_{S_i}$ are the variables, other than $x_i$, that appear in the summand
 define evidence potential $\delta(x_i, \bar{x_i})$ is defined as $x_i == \bar{x_i}$
 then \(g(\bar{x_i}) = \sum_{x_i} \delta (x_i, \bar{x_i})\)
 for a set $\delta (x_E, \bar{x_E}) = \prod_{i \in E} \delta (x_i, \bar{x_i})$
 lets us define $p(x, \bar{x}_E) = p^E(x) = p(x) \delta (x_E, \bar{x_E})$
 undirected graphs
 $\psi_i^E(x_i) \triangleq \psi_i(x_i) \delta(x_i, \bar{x}_i)$
 this lets us write $p^E (x) = \frac{1}{Z} \prod_{c\in C} \psi^E_{X_c} (x_c)$
 can ignore z since this is unnormalized anyway
 to find conditional probability, divide by all sum of $p^E(x)$ for all values of E
 in actuality don’t compute the product, just take the correct slice
 eliminate algorithm
 initialize: choose an ordering with query last
 evidence: set evidence vars to their values
 update: loop over element $x_i$ in ordering
 let $\phi_i(x_{T_i})$ be product of all potentials involving $x_i$
 sum over the product of these potentials $m_i(x_{S_i}) = \sum_x \phi_i(x_{T_i})$

normalize: $p(x_F \bar{x}E) = \phi_F(x_F) / \sum{x_F} \phi_F (x_F)$
 undirected graph elimination algorithm
 for directed graph, first moralize
 for each node connect its parents
 drop edges orientation
 for each node X
 connect all remaining neighbors of X
 remove X from graph
 for directed graph, first moralize
 reconstituted graph  same nodes, includes all edges that were added
 elimination cliques  includes X and its neighbors when X is removed
 computational complexity is the exponential in the number of variables in the elimination clique
 involves treewidth  one less than smallest achievable value of cardinality of largest elimination clique
 range over all possible elimination orderings
 NPhard to find elimination ordering that achieves the treewidth
propagation factor graphs  J 4
 tree  undirected graph in which there is exactly one path between any pair of nodes
 if directed, then moralized graph should be a tree
 polytree  directed graph that reduces to an undirected tree if we convert each directed edge to an undirected edge

\[p(x) = \frac{1}{Z} \left[ \prod_{i \in V} \psi (x_i) \prod_{(i,j)\in E} \psi (x_i,x_j) \right]\]
 for directed, root has individual prob and others are conditionals
 can once again use evidence potentials for conditioning
probabilistic inference on trees
 eliminate algorithm through messagepassing
 ordering I should be depthfirst traversal of tree with f as root and all edges pointing away
 message $m_{ji}(x_i)$ from $j$ to $i$ =intermediate factor
 ordering I should be depthfirst traversal of tree with f as root and all edges pointing away

2 key equations
 $m_{ji}(x_i) = \sum_{x_j} \left( \psi^E (x_j) \psi (x_i, x_j) \prod_{k \in N(j) \backslash i} m_{kj} (x_j) \right)$

$p(x_f \bar{x}E) \propto \psi^E (x_f) \prod{e \in N(f)} m_{ef} (x_f) $
 sumproduct = belief propagation  inference algorithm

computes all singlenode marginals (for certain classes of graphs) rather than only a single marginal

only works in trees or treelike graphs

messagepassing protocol  a node can send a message to a neighboring node when, and only when, it has received messages from all of its other neighbors (parallel algorithm)
 evidence(E)
 choose root
 collect: send messages evidence to root
 distribute: send messages root back out

factor graphs
 factor graphs capture factorizations, not conditional independence statements
 ex $\psi (x_1, x_2, x_3) = f_a(x_1,x_2) f_b(x_2,x_3) f_c (x_1,x_3)$ factors but has no conditional independence
 \[f(x_1,...,x_n) = \prod_s f_s (x_{C_s})\]
 neighborhood N(s) for a factor index s is all the variables the factor references
 neighborhood N(i) for a node i is set of factors that reference $x_i$
 provide more finegrained representation of prob. distr.
 could add more nodes to normal graphical model to do this
 ex $\psi (x_1, x_2, x_3) = f_a(x_1,x_2) f_b(x_2,x_3) f_c (x_1,x_3)$ factors but has no conditional independence
 factor tree  if factors are made nodes, resulting undirected graph is tree
 two kinds of messages (variable> factor & factor> variable)
 run all the factor $\to$ variables first
 \[p(x_i) \propto \prod_{s \in N(i)} \mu_{si} (x_i)\]
 if a graph is originally a tree, there is little to be gained from factor graph framework
 sometimes factor graph is factor tree, but original graph is not
maximum a posteriori (MAP)

want $\max_{x_F} p(x_F \bar{x}_E)$  MAPeliminate algorithm is very similar to before
 initialize  choose ordering
 evidence  set evidence
 update  for each take max over variable and make new factor
 maximum  marginalize
 products of probs tend to underflow, so take $\max_x \log p^E (x)$
 can also derive a maxproduct algorithm for trees
 find $\text{argmax}_x p^E (x)$
 can solve by keeping track of maximizing values of variables in maxproduct algorithm
dynamic bayesian nets
 dynamic bayesian nets  represents a temporal prob. model
state space model

state space model

$P(X_{0:t}, E_{1:t}) = P(X_0) \prod_{i} \underbrace{P(X_i X_{i1}) }_{\text{transition model}} : \underbrace{P(E_i X_i)}_{\text{sensor model}}$ 
agent maintains belief state of state variables $X_t$ given evidence variables $E_t$

improve accuracy
 increase order of Markov transition model
 increase set of state variables (can be equivalent to 1)
 hard to maintain state variables over time, want more sensors


4 inference problems (here $\cdot$ is elementwise multiplication)

filtering = state estimation  compute $P(X_t e_{1:t})$  recursive estimation: $$\underbrace{P(X_{t+1} e_{1:t+1})}{\text{new state}} = \alpha : \underbrace{P(e{t+1} X_{t+1})}{\text{sensor}} \cdot \underset{x_t}{\sum} : \underbrace{P(X{t+1} x_t)}_{\text{transition}} \cdot \underbrace{P(x_t e_{1:t})}_{\text{old state}}$$ where $\alpha$ normalizes probs 
prediction  compute $P(X_{t+k} e_{1:t})$ for $k>0$


$\underbrace{P(X_{t+k+1} e_{1:t})}{\text{new state}} = \sum{x_{t+k}} \underbrace{P(X_{t+k+1} x_{t+k})}{\text{transition}} \cdot \underbrace{P(x{t+k} e_{1:t})}_{\text{old state}}$

smoothing  compute $P(X_{k} e_{1:t})$ for $0 < k < t$ 
2 components $P(X_k e_{1:t}) = \alpha \underbrace{P(X_k e_{1:k})}{\text{forward}} \cdot \underbrace{P(e{k+1:t} X_k)}_{\text{backward}}$  forward pass: filtering from $1:t$ 2. backward pass from $t:1$ $\underbrace{P(e_{k+1:t}X_k)}{\text{sensor past k}} = \sum{x_{k+1}} \underbrace{P(e_{k+1}x_{k+1})}{\text{sensor}} \cdot \underbrace{P(e{k+2:t}x_{k+1})}{\text{recursive call}} \cdot \underbrace{P(x{k+1}X_k)}_{\text{transition}}$ 3. this is called the forwardbackward algo(also there is a separate algorithm that doesn’t use the observations on the backward pass)


most likely explanation  $\underset{x_{1:t}}{\text{argmax}}:P(x_{1:t} e_{1:t})$ 
Viterbi algorithm: $\underbrace{\underset{x_{1:t}}{\text{max}} : P(x_{1:t}, X_{t+1} e_{1:t+1})}{\text{mle x}} = \alpha : \underbrace{P(e{t+1} X_{t+1})}{\text{sensor}} \cdot \underset{x_t}{\text{max}} \left[ : \underbrace{P(X{t+1} x_t)}{\text{transition}} \cdot \underbrace{\underset{x{1:t1}}{\text{max}} :P(x_{1:t1}, x_{t+1} e_{1:t})}_{\text{max prev state}} \right]$  complexity
 K = number of states
 M = number of observations
 n = length of sequence
 memory  $nK$
 runtime  $O(nK^2)$


learning  form of EM
 basically just count (maximizing joint likelihood of input and output)
 initial state probs $\frac{count(start \to s)}{n}$

$P(x’ x) = \frac{count(s \to s’)}{count(s)}$ 
$P(y x) = \frac{count (x \to y)}{count(x)}$
hmm
 state is a single discrete process
 transitions are all matrices (and no zeros in sensor model)$\implies$ forward pass is invertible so can use constant space
 online smoothing (with lag)
 ex. robot localization
kalman filtering
 state is continuous
 ex.
 type of nodes (realvalued vectors) and prob model (linearGaussian) changes from HMM
 1d example: random walk

state nodes: $x_{t+1} = Ax_t + Gw_t$
 output nodes: $y_t = Cx_t+v_t$
 x is linear Gaussian
 w is noise Gaussian
 y is linear Gaussian
 doing the integral for prediction involves completing the square
 properties
 new mean is weighted mean of new observation and old mean
 update rule for variance is independent of the observation
 variance converges quickly to fixed value that depends only on $\sigma^2_x, \sigma^2_z$
 Lyapunov eqn: evolution of variance of states
 information filter  mathematically the same but different parameterization
 extended Kalman filter
 works on nonlinear systems
 locally linear
 switching Kalman filter  multiple Kalman filters run in parallel and weighted sum of predictions is used
 ex. one for straight flight, one for sharp left turns, one for sharp right turns
 equivalent to adding discrete “maneuver” state variable
general dbns
 can be better to decompose state variable into multiple vars
 reduces size of transition matrix
 transient failure model  allows probability of sensor giving wrong value
 persistent failure model  additional variable describing status of battery meter
 exact inference  våariable elimination mimics recursive filtering
 still difficult
 approximate inference  modification of likelihood weighting
 use samples as approximate representation of current state distr.
 particle filtering  focus set of samples on highprob regions of the state space
 consistent
 sample a state
 sample the next state given the previous state

weight each sample by $P(e_t x_t)$  resample based on weight
structure learning
 conditional correlation  inverse covariance matrix = precision matrix
 estimates only good when $n » p$
 eigenvalues are not wellapproximated
 often enforce sparsity
 ex. threshold each value in the cov matrix (set to 0 unless greater than thresh)  this threshold can depend on different things
 can also use regularization to enforce sparsity
 POET doesn’t assume sparsity