structure learning view markdown
introduction
- structured prediction - have multiple independent output variables
- output assignments are evaluated jointly
- requires joint (global) inference
- can’t use classifier because output space is combinatorially large
- three steps
- model - pick a model
- learning = training
- inference = testing
- representation learning - picking features
- usually use domain knowledge
- combinatorial - ex. map words to higher dimensions
- hierarchical - ex. first layers of CNN
structure
- structured output can be represented as a graph
- outputs y
- inputs x
- two types of info are useful
- relationships between x and y
- relationships betwen y and y
- complexities
- modeling - how to model?
- train - can’t train separate weight vector for each inference outcome
- inference - can’t enumerate all possible structures
- need to score nodes and edges
- could score nodes and edges independently
- could score each node and its edges together
sequential models
sequence models
- goal: learn distribution P(x1,…,xn) for sequences x1,…,xn
- ex. text generation
- discrete Markov model
- P(x1,…,xn)=∏iP(xi|xi−1)
- requires
- initial probabilites
- transition matrix
- mth order Markov model - keeps history of previous m states
- each state is an observation
conditional models and local classifiers - discriminative model
- conditional models = discriminative models
- goal: model P(Y|X)
- learns the decision boundary only
- ignores how data is generated (like generative models)
- ex. log-linear models
- P(y|x,w)=exp(wTϕ(x,y))∑y′exp(wTϕ(x,y′))
- training: w=argminw∑log:P(yi|xi,w)
- ex. next-state model
- P(y|x)=∏iP(yi|yi−1,xi)
- ex. maximum entropy markov model
- P(yi|yi−1,x)∝exp(wTϕ(x,i,yi,yi−1))
- adds more things into the feature representation than HMM via ϕ
- has label bias problem
- if state has fewer next states they get high probability
- effectively ignores x if P(yi|yi−1) is too high
- if state has fewer next states they get high probability
- P(yi|yi−1,x)∝exp(wTϕ(x,i,yi,yi−1))
- ex. conditional random fields=CRF
- a global, undirected graphical model
- divide into factors
- P(Y|x)=1Z∏iexp(wTϕ(x,yi,yi−1))
- Z=∑ˆy∏iexp(wTϕ(x,^yi,ˆyi−1))
- ϕ(x,y)=∑iϕ(x,yi,yi−1)
- prediction via Viterbi (with sum instead of product)
- training
- maximize log-likelihood maxW−λ2wTw+∑log:P(yI|xI,w)
- requires inference
- linear-chain CRF - only looks at current and previous labels
- a global, undirected graphical model
- ex. structured perceptron
- HMM is a linear classifier
constrained conditional models
consistency of outputs and the value of inference
- ex. POS tagging - sentence shouldn’t have more than 1 verb
- inference
- a global decision comprising of multiple local decisions and their inter-dependencies
- local classifiers
- constraints
- a global decision comprising of multiple local decisions and their inter-dependencies
-
learning
- global - learn with inference (computationally difficult)
hard constraints and integer programs
soft constraints
inference
- inference constructs the output given the model
-
goal: find highest scoring state sequence
- argmaxy:score(y)=argmaxywTϕ(x,y)
- naive: score all and pick max - terribly slow
- viterbi - decompose scores over edges
- questions
- exact v. approximate inference
- exact - search, DP, ILP
- approximate = heuristic - Gibbs sampling, belief propagation, beam search, linear programming relaxations
- randomized v. deterministic
- if run twice, do you get same answer
- exact v. approximate inference
- ILP - integer linear programs
- combinatorial problems can be written as integer linear programs
- many commercial solvers and specialized solvers
- NP-hard in general
- special case of linear programming - minimizing/maximizing a linear objective function subject to a finite number of linear constraints (equality or inequality)
- in general, c=argmaxc:cTx subject to Ax≤b
- maybe more constraints like x≥0
- the constraint matrix defines a polytype
- only the vertices or faces of the polytope can be solutions
- ⟹ can be solved in polynomial time
- in ILP, each xi is an integer
- LP-relaxation - drop the integer constraints and hope for the best
- 0-1 ILP - x∈0,1n
- decision variables for each label zA=1 if output=A, 0 otherwise
- don’t solve multiclass classification with an ILP solver (makes it harder)
- belief propagation
- variable elimination
- fix an ordering of the variables
- iteratively, find the best value given previous neighbors
- use DP
- ex. Viterbi is max-product variable elimination
- when there are loops, require approximate solution
- uses message passing to determine marginal probabilities of each variable
- message mij(xj) high means node i believes P(xj) is high
- use beam search - keep size-limited priority queue of states
- uses message passing to determine marginal probabilities of each variable
- variable elimination
learning protocols
structural svm
- minw:12wTw+C∑imaxy(wTϕ(xi,y)+Δ(y,yi)−wTϕ(xi,yi))
empirical risk minimization
- subgradients
- ex. f(x)=max(f1(x),f2(x)), solve the max then compute gradient of whichever function is argmax
sgd for structural svm
- highest scoring assignment to some of the output random variables for a given input?
- loss-augmented inference - which structure most violates the margin for a given scoring function?
- adagrad - frequently updated features should get smaller learning rates