5.2. nlp¶

Some notes on natural language processing, focused on modern improvements based on deep learning.

5.2.1. nlp basics¶

  • basics come from book “Speech and Language Processing”

  • language models - assign probabilities to sequences of words

    • ex. n-gram model - assigns probs to shorts sequences of words, known as n-grams

      • for full sentence, use markov assumption

    • eval: perplexity (PP) - inverse probability of the test set, normalized by the number of words (want to minimize it)

      • \(PP(W_{test}) = P(w_1, ..., w_N)^{-1/N}\)

      • can think of this as the weighted average branching factor of a language

      • should only be compared across models w/ same vocab

    • vocabulary

      • sometimes closed, otherwise have unkown words, which we assign its own symbol

      • can fix training vocab, or just choose the top words and have the rest be unkown

  • topic models (e.g. LDA) - apply unsupervised learning on large sets of text to learn sets of associated words

  • embeddings - vectors for representing words

    • ex. tf-idf - defined as counts of nearby words (big + sparse)

      • pointwise mutual info - instead of counts, consider whether 2 words co-occur more than we would have expected by chance

    • ex. word2vec - short, dense vectors

      • intuition: train classifier on binary prediction: is word \(w\) likely to show up near this word? (algorithm also called skip-gram)

        • the weights are the embeddings

      • also GloVe, which is based on ratios of word co-occurrence probs

  • some tasks

    • tokenization

    • pos tagging

    • named entity recognition

      • nested entity recognition - not just names (but also Jacob’s brother type entity)

    • sentiment classification

    • language modeling (i.e. text generation)

    • machine translation

    • hardest: coreference resolution

    • question answering

    • natural language inference - does one sentence entail another?

  • most popular datasets

    • (by far) WSJ

    • then twitter

    • then Wikipedia

  • eli5 has nice text highlighting for interp

5.2.2. dl for nlp¶

  • some recent topics based on this blog

  • rnns

    • when training rnn, accumulate gradients over sequence and then update all at once

    • stacked rnns have outputs of rnns feed into another rnn

    • bidirectional rnn - one rnn left to right and another right to left (can concatenate, add, etc.)

  • standard seq2seq

    • encoder reads input and outputs context vector (the hidden state)

    • decoder (rnn) takes this context vector and generates a sequence

  • misc papers

5.2.2.1. attention / transformers¶

  • self-attention layer implementation and mathematics

  • **self-attention ** - layer that lets word learn its relation to other layers

    • for each word, want score telling how much importance to place on each other word (queries \(\cdot\) keys)

    • we get an encoding for each word

      • the encoding of each word returns a weighted sum of the values of the words (the current word gets the highest weight)

      • softmax this and use it to do weighted sum of valuesScreen Shot 2019-08-17 at 2.51.53 PM

    • (optional) implementation details

      • multi-headed attention - just like having many filters, get many encodings for each word

        • each one can take input as the embedding from the previous attention layer

      • position vector - add this into the embedding of each word (so words know how far apart they are) - usually use sin/cos rather than actual position number

      • padding mask - add zeros to the end of the sequence

      • look-ahead mask - might want to mask to only use previous words (e.g. if our final task is decoding)

      • residual + normalize - after self-attention layer, often have residual connection to previous input, which gets added then normalized

    • decoder - each word only allowed to attend to previous positions

    • 3 components

      • queries

      • keys

      • values

  • attention

    • encoder reads input and ouputs context vector after each word

    • decoder at each step uses a different weighted combination of these context vectors

      • specifically, at each step, decoder concatenates its hidden state w/ the attention vector (the weighted combination of the context vectors)

      • this is fed to a feedforward net to output a word

      • Screen Shot 2019-04-11 at 7.57.14 PM

    • at a high level we have \(Q, K, V\) and compute \(softmax(QK^T)V\)

      • instead could simplify it and do \(softmax(XX^T)V\) - this would then be based on kernel

  • transformer

    • uses many self-attention layers

    • many stacked layers in encoder + decoder (not rnn: self-attention + feed forward)

    • details

      • initial encoding: each word -> vector

      • each layer takes a list of fixed size (hyperparameter e.g. length of longest sentence) and outputs a list of that same fixed size (so one output for each word)

        • can easily train with a masked word to predict the word at the predicted position in the encoding

    • multi-headed attention has several of each of these (then just concat them)

  • recent papers

  • these ideas are starting to be applied to vision cnns