5.4. structure learning

5.4.1. introduction

  1. structured prediction - have multiple independent output variables

  • output assignments are evaluated jointly

  • requires joint (global) inference

  • can’t use classifier because output space is combinatorially large

  • three steps

    1. model - pick a model

    2. learning = training

    3. inference = testing

  1. representation learning - picking features

  • usually use domain knowledge

  1. combinatorial - ex. map words to higher dimensions

  2. hierarchical - ex. first layers of CNN

5.4.2. structure

  • structured output can be represented as a graph

  • outputs y

  • inputs x

  • two types of info are useful

    • relationships between x and y

    • relationships betwen y and y

  • complexities

    1. modeling - how to model?

    2. train - can’t train separate weight vector for each inference outcome

    3. inference - can’t enumerate all possible structures

  • need to score nodes and edges

    • could score nodes and edges independently

    • could score each node and its edges together

5.4.3. sequential models sequence models

  • goal: learn distribution \(P(x_1,...,x_n)\) for sequences \(x_1,...,x_n\)

    • ex. text generation

  • discrete Markov model

    • \(P(x_1,...,x_n) = \prod_i P(x_i \vert x_{i-1})\)

    • requires

      1. initial probabilites

      2. transition matrix

  • mth order Markov model - keeps history of previous m states

  • each state is an observation conditional models and local classifiers - discriminative model

  • conditional models = discriminative models

    • goal: model \(P(Y\vert X)\)

    • learns the decision boundary only

    • ignores how data is generated (like generative models)

  • ex. log-linear models

    • \(P(\mathbf{y\vert x,w}) = \frac{exp(w^T \phi (x,y))}{\sum_y' exp(w^T \phi (x,y'))}\)

    • training: \(w = \underset{w}{argmin} \sum log \: P(y_i\vert x_i,w)\)

  • ex. next-state model

    • \(P(\mathbf{y}\vert \mathbf{x})=\prod_i P(y_i\vert y_{i-1},x_i)\)

  • ex. maximum entropy markov model

    • \(P(y_i\vert y_{i-1},x) \propto exp( w^T \phi(x,i,y_i,y_{i-1}))\)

      • adds more things into the feature representation than HMM via \(\phi\)

    • has label bias problem

      • if state has fewer next states they get high probability

        • effectively ignores x if \(P(y_i\vert y_{i-1})\) is too high

  • ex. conditional random fields=CRF

    • a global, undirected graphical model

      • divide into factors

    • \(P(Y\vert x) = \frac{1}{Z} \prod_i exp(w^T \phi (x,y_i,y_{i-1}))\)

      • \(Z = \sum_{\hat{y}} \prod_i exp(w^T \phi (x,\hat{y_i},\hat{y}_{i-1}))\)

      • \(\phi (x,y) = \sum_i \phi (x,y_i,y_{i-1})\)

    • prediction via Viterbi (with sum instead of product)

    • training

      • maximize log-likelihood \(\underset{W}{max} -\frac{\lambda}{2} w^T w + \sum log \: P(y_I\vert x_I,w)\)

      • requires inference

    • linear-chain CRF - only looks at current and previous labels

  • ex. structured perceptron

    • HMM is a linear classifier

5.4.4. constrained conditional models consistency of outputs and the value of inference

  • ex. POS tagging - sentence shouldn’t have more than 1 verb

  • inference

    • a global decision comprising of multiple local decisions and their inter-dependencies

    1. local classifiers

    2. constraints

  • learning

    • global - learn with inference (computationally difficult) hard constraints and integer programs soft constraints

5.4.5. inference

  • inference constructs the output given the model

  • goal: find highest scoring state sequence

    • \(argmax_y \: score(y) = argmax_y w^T \phi(x,y)\)

  • naive: score all and pick max - terribly slow

  • viterbi - decompose scores over edges

  • questions

    1. exact v. approximate inference

    • exact - search, DP, ILP

    • approximate = heuristic - Gibbs sampling, belief propagation, beam search, linear programming relaxations

    1. randomized v. deterministic

    • if run twice, do you get same answer

  • ILP - integer linear programs

    • combinatorial problems can be written as integer linear programs

    • many commercial solvers and specialized solvers

    • NP-hard in general

    • special case of linear programming - minimizing/maximizing a linear objective function subject to a finite number of linear constraints (equality or inequality)

      • in general, \( c = \underset{c}{argmax}\: c^Tx \) subject to \(Ax \leq b\)

      • maybe more constraints like \(x \geq 0\)

      • the constraint matrix defines a polytype

      • only the vertices or faces of the polytope can be solutions

      • \(\implies\) can be solved in polynomial time

    • in ILP, each \(x_i\) is an integer

    • LP-relaxation - drop the integer constraints and hope for the best

    • 0-1 ILP - \(\mathbf{x} \in \{0,1\}^n\)

    • decision variables for each label \(z_A = 1\) if output=A, 0 otherwise

    • don’t solve multiclass classification with an ILP solver (makes it harder)

  • belief propagation

    • variable elimination

      1. fix an ordering of the variables

      2. iteratively, find the best value given previous neighbors

      • use DP

      • ex. Viterbi is max-product variable elimination

    • when there are loops, require approximate solution

      • uses message passing to determine marginal probabilities of each variable

        • message \(m_{ij}(x_j)\) high means node i believes \(P(x_j)\) is high

      • use beam search - keep size-limited priority queue of states

5.4.6. learning protocols structural svm

  • \(\underset{w}{min} \: \frac{1}{2} w^T w + C \sum_i \underset{y}{max} (w^T \phi (x_i,y)+ \Delta(y,y_i) - w^T \phi(x_i,y_i) )\) empirical risk minimization

  • subgradients

    • ex. \(f(x) = max ( f_1(x), f_2(x))\), solve the max then compute gradient of whichever function is argmax sgd for structural svm

  • highest scoring assignment to some of the output random variables for a given input?

  • loss-augmented inference - which structure most violates the margin for a given scoring function?

  • adagrad - frequently updated features should get smaller learning rates