Processing math: 100%

structure learning view markdown


introduction

  1. structured prediction - have multiple independent output variables
    • output assignments are evaluated jointly
    • requires joint (global) inference
    • can’t use classifier because output space is combinatorially large
    • three steps
    1. model - pick a model
    2. learning = training
    3. inference = testing
  2. representation learning - picking features
    • usually use domain knowledge
  3. combinatorial - ex. map words to higher dimensions
  4. hierarchical - ex. first layers of CNN

structure

  • structured output can be represented as a graph
  • outputs y
  • inputs x
  • two types of info are useful
    • relationships between x and y
    • relationships betwen y and y
  • complexities
    1. modeling - how to model?
    2. train - can’t train separate weight vector for each inference outcome
    3. inference - can’t enumerate all possible structures
  • need to score nodes and edges
    • could score nodes and edges independently
    • could score each node and its edges together

sequential models

sequence models

  • goal: learn distribution P(x1,,xn) for sequences x1,,xn
    • ex. text generation
  • discrete Markov model
    • P(x1,,xn)=iP(xi|xi1)
    • requires
      1. initial probabilites
      2. transition matrix
  • mth order Markov model - keeps history of previous m states
  • each state is an observation

conditional models and local classifiers - discriminative model

  • conditional models = discriminative models
    • goal: model P(Y|X)
    • learns the decision boundary only
    • ignores how data is generated (like generative models)
  • ex. log-linear models
    • P(y|x,w)=exp(wTϕ(x,y))yexp(wTϕ(x,y))
    • training: w=argminwlog:P(yi|xi,w)
  • ex. next-state model
    • P(y|x)=iP(yi|yi1,xi)
  • ex. maximum entropy markov model
    • P(yi|yi1,x)exp(wTϕ(x,i,yi,yi1))
      • adds more things into the feature representation than HMM via ϕ
    • has label bias problem
      • if state has fewer next states they get high probability
        • effectively ignores x if P(yi|yi1) is too high
  • ex. conditional random fields=CRF
    • a global, undirected graphical model
      • divide into factors
    • P(Y|x)=1Ziexp(wTϕ(x,yi,yi1))
      • Z=ˆyiexp(wTϕ(x,^yi,ˆyi1))
      • ϕ(x,y)=iϕ(x,yi,yi1)
    • prediction via Viterbi (with sum instead of product)
    • training
      • maximize log-likelihood maxWλ2wTw+log:P(yI|xI,w)
      • requires inference
    • linear-chain CRF - only looks at current and previous labels
  • ex. structured perceptron
    • HMM is a linear classifier

constrained conditional models

consistency of outputs and the value of inference

  • ex. POS tagging - sentence shouldn’t have more than 1 verb
  • inference
    • a global decision comprising of multiple local decisions and their inter-dependencies
      1. local classifiers
      2. constraints
  • learning

    • global - learn with inference (computationally difficult)

hard constraints and integer programs

soft constraints

inference

  • inference constructs the output given the model
  • goal: find highest scoring state sequence

    • argmaxy:score(y)=argmaxywTϕ(x,y)
  • naive: score all and pick max - terribly slow
  • viterbi - decompose scores over edges
  • questions
    1. exact v. approximate inference
      • exact - search, DP, ILP
      • approximate = heuristic - Gibbs sampling, belief propagation, beam search, linear programming relaxations
    2. randomized v. deterministic
      • if run twice, do you get same answer
  • ILP - integer linear programs
    • combinatorial problems can be written as integer linear programs
    • many commercial solvers and specialized solvers
    • NP-hard in general
    • special case of linear programming - minimizing/maximizing a linear objective function subject to a finite number of linear constraints (equality or inequality)
      • in general, c=argmaxc:cTx subject to Axb
      • maybe more constraints like x0
      • the constraint matrix defines a polytype
      • only the vertices or faces of the polytope can be solutions
      • can be solved in polynomial time
    • in ILP, each xi is an integer
    • LP-relaxation - drop the integer constraints and hope for the best
    • 0-1 ILP - x0,1n
    • decision variables for each label zA=1 if output=A, 0 otherwise
    • don’t solve multiclass classification with an ILP solver (makes it harder)
  • belief propagation
    • variable elimination
      1. fix an ordering of the variables
      2. iteratively, find the best value given previous neighbors
        • use DP
        • ex. Viterbi is max-product variable elimination
    • when there are loops, require approximate solution
      • uses message passing to determine marginal probabilities of each variable
        • message mij(xj) high means node i believes P(xj) is high
      • use beam search - keep size-limited priority queue of states

learning protocols

structural svm

  • minw:12wTw+Cimaxy(wTϕ(xi,y)+Δ(y,yi)wTϕ(xi,yi))

empirical risk minimization

  • subgradients
    • ex. f(x)=max(f1(x),f2(x)), solve the max then compute gradient of whichever function is argmax

sgd for structural svm

  • highest scoring assignment to some of the output random variables for a given input?
  • loss-augmented inference - which structure most violates the margin for a given scoring function?
  • adagrad - frequently updated features should get smaller learning rates