structure learning view markdown
introduction
 structured prediction  have multiple independent output variables
 output assignments are evaluated jointly
 requires joint (global) inference
 can’t use classifier because output space is combinatorially large
 three steps
 model  pick a model
 learning = training
 inference = testing
 representation learning  picking features
 usually use domain knowledge
 combinatorial  ex. map words to higher dimensions
 hierarchical  ex. first layers of CNN
structure
 structured output can be represented as a graph
 outputs y
 inputs x
 two types of info are useful
 relationships between x and y
 relationships betwen y and y
 complexities
 modeling  how to model?
 train  can’t train separate weight vector for each inference outcome
 inference  can’t enumerate all possible structures
 need to score nodes and edges
 could score nodes and edges independently
 could score each node and its edges together
sequential models
sequence models
 goal: learn distribution $P(x_1,…,x_n)$ for sequences $x_1,…,x_n$
 ex. text generation
 discrete Markov model
 $P(x_1,…,x_n) = \prod_i P(x_i \vert x_{i1})$
 requires
 initial probabilites
 transition matrix
 mth order Markov model  keeps history of previous m states
 each state is an observation
conditional models and local classifiers  discriminative model
 conditional models = discriminative models
 goal: model $P(Y\vert X)$
 learns the decision boundary only
 ignores how data is generated (like generative models)
 ex. loglinear models
 $P(\mathbf{y\vert x,w}) = \frac{exp(w^T \phi (x,y))}{\sum_y’ exp(w^T \phi (x,y’))}$
 training: $w = \underset{w}{argmin} \sum log : P(y_i\vert x_i,w)$
 ex. nextstate model
 $P(\mathbf{y}\vert \mathbf{x})=\prod_i P(y_i\vert y_{i1},x_i)$
 ex. maximum entropy markov model
 $P(y_i\vert y_{i1},x) \propto exp( w^T \phi(x,i,y_i,y_{i1}))$
 adds more things into the feature representation than HMM via $\phi$
 has label bias problem
 if state has fewer next states they get high probability
 effectively ignores x if $P(y_i\vert y_{i1})$ is too high
 if state has fewer next states they get high probability
 $P(y_i\vert y_{i1},x) \propto exp( w^T \phi(x,i,y_i,y_{i1}))$
 ex. conditional random fields=CRF
 a global, undirected graphical model
 divide into factors
 $P(Y\vert x) = \frac{1}{Z} \prod_i exp(w^T \phi (x,y_i,y_{i1}))$
 $Z = \sum_{\hat{y}} \prod_i exp(w^T \phi (x,\hat{y_i},\hat{y}_{i1}))$
 $\phi (x,y) = \sum_i \phi (x,y_i,y_{i1})$
 prediction via Viterbi (with sum instead of product)
 training
 maximize loglikelihood $\underset{W}{max} \frac{\lambda}{2} w^T w + \sum log : P(y_I\vert x_I,w)$
 requires inference
 linearchain CRF  only looks at current and previous labels
 a global, undirected graphical model
 ex. structured perceptron
 HMM is a linear classifier
constrained conditional models
consistency of outputs and the value of inference
 ex. POS tagging  sentence shouldn’t have more than 1 verb
 inference
 a global decision comprising of multiple local decisions and their interdependencies
 local classifiers
 constraints
 a global decision comprising of multiple local decisions and their interdependencies

learning
 global  learn with inference (computationally difficult)
hard constraints and integer programs
soft constraints
inference
 inference constructs the output given the model

goal: find highest scoring state sequence
 $argmax_y : score(y) = argmax_y w^T \phi(x,y)$
 naive: score all and pick max  terribly slow
 viterbi  decompose scores over edges
 questions
 exact v. approximate inference
 exact  search, DP, ILP
 approximate = heuristic  Gibbs sampling, belief propagation, beam search, linear programming relaxations
 randomized v. deterministic
 if run twice, do you get same answer
 exact v. approximate inference
 ILP  integer linear programs
 combinatorial problems can be written as integer linear programs
 many commercial solvers and specialized solvers
 NPhard in general
 special case of linear programming  minimizing/maximizing a linear objective function subject to a finite number of linear constraints (equality or inequality)
 in general, $ c = \underset{c}{argmax}: c^Tx $ subject to $Ax \leq b$
 maybe more constraints like $x \geq 0$
 the constraint matrix defines a polytype
 only the vertices or faces of the polytope can be solutions
 $\implies$ can be solved in polynomial time
 in ILP, each $x_i$ is an integer
 LPrelaxation  drop the integer constraints and hope for the best
 01 ILP  $\mathbf{x} \in {0,1}^n$
 decision variables for each label $z_A = 1$ if output=A, 0 otherwise
 don’t solve multiclass classification with an ILP solver (makes it harder)
 belief propagation
 variable elimination
 fix an ordering of the variables
 iteratively, find the best value given previous neighbors
 use DP
 ex. Viterbi is maxproduct variable elimination
 when there are loops, require approximate solution
 uses message passing to determine marginal probabilities of each variable
 message $m_{ij}(x_j)$ high means node i believes $P(x_j)$ is high
 use beam search  keep sizelimited priority queue of states
 uses message passing to determine marginal probabilities of each variable
 variable elimination
learning protocols
structural svm
 $\underset{w}{min} : \frac{1}{2} w^T w + C \sum_i \underset{y}{max} (w^T \phi (x_i,y)+ \Delta(y,y_i)  w^T \phi(x_i,y_i) )$
empirical risk minimization
 subgradients
 ex. $f(x) = max ( f_1(x), f_2(x))$, solve the max then compute gradient of whichever function is argmax
sgd for structural svm
 highest scoring assignment to some of the output random variables for a given input?
 lossaugmented inference  which structure most violates the margin for a given scoring function?
 adagrad  frequently updated features should get smaller learning rates