3.7. nlp#
See also notes in đ transformers.
3.7.1. benchmarks#
3.7.1.1. tasks#
Nice repo keeping track of progress here.
tokenization - convert raw text into tokens that can be fed to a model
token/word embeddings - convert tokens to semantic vectors
pos tagging - give each word its part-speech
named entity recognition - classify tokens based on some known annotations (e.g. person, organization)
nested entity recognition - more complex than individual tokens (e.g. âMary = âJacobâs brotherâ)
parse tree extraction - extract out tree of how words operate in a grammar
sentiment classification
text summarization
language modeling (i.e. text generation)
(machine) translation
coreference resolution - correspond simple entities (e.g. âshe = Maryâ, different names for a protein)
question answering
natural language inference - determining whether a âhypothesisâ is true (entailment), false (contradiction), or undetermined (neutral) given a âpremiseâ
topic modeling - unsupervised learning over documents
3.7.1.2. datasets#
open-ended
MMLU: Measuring Massive Multitask Language Understanding (hendrycks, âŠ, song, steinhardt, 2021) - 57 tasks such as elementary math, clinical knowledge, global facts
NLI: allenai natural language instructions (v1: mishra et al. 2022; v2: wang et al. 2022) - 1600+ diverse NLP tasks
Definition which describes the task
For each example: (input text, output (varies), explanation*) - explanation only for some tasks
Big-bench (2022) - crowdsourced diverse (but random) NLP tasks
No task definitions or explanations: the âdefinitionâ is instead concatenated to each exampleâs input
Unveiling Transformers with LEGO: a synthetic reasoning task (zhang, âŠ, bubeck, âŠ, wagner, 2022) - synthetic reasoning task
very popular
SuperGLUE Benchmark (2019)
8 NLU tasks (keeps two hardest from GLUE, marked with ***) extending to coreference resolution and question-answering
BoolQ (Boolean Questions) - answer a yes/no question about a text passage.
CB (CommitmentBank) - given a text and a clause, predict how much the text commits to the clause
COPA (Choice of Plausible Alternatives) - given a premise sentence and two possible choices, the system must determine either the cause or effect of the premise from two possible choices
MultiRC (Multi-sentence Reading Comprehension) - given a context paragraph, a question about that paragraph, and a list of possible answers, the system must predict which answers are true and which are false
ReCoRD (Reading Comprehension with Commonsense Reasoning Dataset) - multiple-choice QA task (news article + Cloze-style multiple-choice question)
\(^\dagger\)RTE (Recognizing Textual Entailment) - determine if a sentence entails a given hypothesis or not.
WiC (Word-in-Context) - word sense disambiguation task cast as binary classification of sentence pairs. Given two text snippets and a polysemous word that appears in both sentences, the task is to determine whether the word is used with the same sense in both sentences
\(^\dagger\)WSC (Winograd Schema Challenge) - coreference resolution task in which examples consist of a sentence with a pronoun and a list of noun phrases from the sentence. The system must determine the correct referent of the pronoun from among the provided choices.
GLUE Benchmark (2019) = General language understanding evaluation
9 NLU tasks including question answering, sentiment analysis, text similarity and textual entailment
single-sentence tasks
CoLA (Corpus of Linguistic Acceptability) - determine if sentence is grammatically correct
SST-2 (Stanford Sentiment Treebank) - determine if the sentence has a positive or negative sentiment
similarity + paraphrase tasks
MRPC (Microsoft Research Paraphrase Corpus) - determine if two sentences are paraphrases from one another.
QQP (Quora Question Pairs)- determine if two questions are semantically equivalent or not.
STS-B (Semantic Textual Similarity Benchmark) - determine the similarity of two sentences with a score from one to five.
Inference Tasks
MNLI (Multi-Genre Natural Language Inference) - determine if a sentence entails, contradicts, or is unrelated to another sentence
QNLI (Question-answering Natural Language Inference) - determine if the answer to a question is contained in a second sentence or not
RTE (Recognizing Textual Entailment) - determine if a sentence entails a given hypothesis or not
WNLI (Winograd Natural Language Inference) - determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not
more NLI ( natural language inference)
ANLI: Adversarial NLI (nie et al. 2019) - harder examples found by model failures
SNLI Benchmark (bowman et al. 2015) = Stanford Natural Languge Inference - entailment dataset
570k human-annotated sentence pairs where people ask about entailment
FEVER: Fact Extraction and VERification (Thorne et al., 2018)
SciTail (khot et al. 2018) - textual entailment derived from science-question answering
QA
SQuAD 2.0 (RajpurkarâŠliang, 2018) - adds 50k unanswerable questions; system must know when it canât answer
SQuAD (RajpurkarâŠliang, 2016) - Stanford Question Answering Dataset (SQuAD) - 100k questions from 23k passages in 500 wikipedia articles
Text classification datasets (used in Tree-Prompt and Aug-imodels)
Dataset
Classes
Text
Label
SST2
Movie review sentiment
2
that loves its characters and communicates something rather beautiful about human nature
positive
SUBJ
subjective vs objective
2
the script isnât very good; not even someone as gifted as hoffman ( the actor ) can make it work.
subjective
MPQA
question answer sentiment
2
victory of democracy
positive
AGNews
classify news titles
4
Wall St. Bears Claw Back Into the Black (Reuters). âReuters - Short-sellers, Wall Streetâs dwindling band of ultra-cynics, are seeing green again.â
business
CB
given a text and a clause, predict how much the text commits to the clause
3
Premise: âDo you mind if I use your phone?â Ronni could see that Guidoâs brain was whirring. Hypothesis: Guidoâs brain was whirring
entailment
CR
customer review sentiment
2
i didn ât have any major problems installing this software .
positive
DBPedia
Categories in wikipedia
14
Geoffrey D. Falksen (born July 31 1982) is an American steampunk writer.
artist
MR
Movie review sentiment
2
the film is flat .
negative
RTE
Entailment
2
Sentence 1: No Weapons of Mass Destruction Found in Iraq Yet. Sentence 2: âWeapons of Mass Destruction Found in Iraq.
not_entailment
TREC
Classifying questions
6
What âs known as The queen of Drinks ?
entity
FPB
Financial phrase sentiment
3
According to Gran, the company has no plans to move all production to Russia, although that is where the company is growing.
neutral
IMDB
Movie review sentiment
2
would put this at the top of my list of films in the category of unwatchable trash! [âŠ]
negative
Emotion
Tweet emotion classification
6
i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake
sadness
common data sources
WSJ
then twitter
then Wikipedia
3.7.1.3. eval metrics#
perplexity (PP) - inverse probability of the test set, normalized by the number of words (want to minimize it)
\(PP(W_{test}) = P(w_1, ..., w_N)^{-1/N}\)
can think of this as the weighted average branching factor of a language
should only be compared across models w/ same vocab
BLEU
bert-score: Tiiiger/bert_score
3.7.2. individual tasks#
3.7.2.1. tokenization#
Tokenizers - Hugging Face Course
word-based
punctuation splitting
unknown token [UNK] for anything not seen - to reduce the amount of this, can get character-based tokens
vocab tends to be too big
subword-based - break apart meaningful subparts of words, most popular (2022)
many more (e.g. byte-level BPE, used in GPT-2)
character-based - very little prior, generally sequences are too long
vocabulary
sometimes closed, otherwise have unkown words, which we assign its own symbol
can fix training vocab, or just choose the top words and have the rest be unkown
-
stemming - maps words to a common root, usually by just chopping off the ending (e.g. âdogsâ -> âdogâ)
most common way to do this is Porterâs algorithm, which has 5 phrases of rule-based word reductions
lemmatization - smarter version that takes into account vocab / morphological analysis (e.g. âam, are, isâ -> âbeâ)
3.7.2.2. token / word embeddings#
embeddings - vectors for representing words
ex. tf-idf - defined as counts of nearby words (big + sparse)
TF * IDF = [ (Number of times term t appears in a document) / (Total number of terms in the document) ] * log(Total number of documents / Number of documents with term t in it).
pointwise mutual info - instead of counts, consider whether 2 words co-occur more than we would have expected by chance
ex. word2vec - short, dense vectors
intuition: train classifier on binary prediction: is word \(w\) likely to show up near this word? (algorithm also called skip-gram)
the weights are the embeddings
word2vec paper I: initial word2vec (mikolov et al. 2013) - simplifies neural language models for efficient training of word embeddings
maximizing the probabilities of words being predicted by their context words (with a DNN)
continuous bag-of-words (CBOW) - predict current word from window (order doesnât matter)
skipgram - use current word to predict surrounding window â nearby context words weighted more heavily
word2vec paper II: Distributed Representations of Words and Phrases and their Compositionality (mikolov et al. 2013) - word2vec improvements
identify key ngrams and give them their own vecs
ex. GloVe (pennington, socher, & manning, 2014), which is based on ratios of word co-occurrence probs
ex. ELMO (petersâŠzettlemoyer, 2022) - use LSTM for word embeddings
3.7.2.3. language modeling#
language models - assign probabilities to sequences of words
ex. n-gram model - assigns probs to short sequences of words, known as n-grams
for full sentence, use markov assumption
multi-token decoding for classification - regular beam search will favor shorter results over longer ones on average since a negative log-probability is added at each step, yielding lower (more negative) scores for longer sentences
smoothing ngram models
3.7.2.4. topic modeling#
topic models (e.g. LDA) - apply unsupervised learning on large sets of text to learn sets of associated words
LDA = latent dirichlet allocation
3.7.2.5. grammar / parse-tree exraction#
notes/figures from nltk book ch 8/9
language - set of all grammatical sentences
grammar - formal notation that can be used for âgeneratingâ the members of this set
phrases
noun phrases
adjective phrase
structures
constituent structure - words combine with other words to form units which are substitutable
e.g. the fact that we can substitute He for The little bear indicates that the latter sequence is a unit
coordinate structure: if \(v_{1}\) and \(v_{2}\) are both phrases of grammatical category \(X\), then \(v_{1}\) and \(v_{2}\) is also a phrase of category \(X\)
context-free grammar - set of grammar rules that define a gramm
e.g. \(S \to NP\; VP\)
rules can be probabilities (so we search for the most probable parse tree, rather than returning all)
phrase tree
\(\begin{array}{lll}\text { Symbol } & \text { Meaning } & \text { Example } \\ \text { S } & \text { sentence } & \text { the man walked } \\ \text { NP } & \text { noun phrase } & \text { a dog } \\ \text { VP } & \text { verb phrase } & \text { saw a park } \\ \text { PP } & \text { prepositional phrase } & \text { with a telescope } \\ \text { Det } & \text { determiner } & \text { the } \\ \text { N }& \text { noun } & \text { dog } \\ \text { V } & \text { verb } & \text { walked } \\ \text{ P} & \text { preposition } & \text { in }\end{array}\)
note: chunking yields a similar partition of a sequence rather than a tree
algorithms
top-down parsing (e.g. recursive descent parsing) - starts from \(S\) , and keep expanding the grammar rules until finding a match
RecursiveDescentParser is unable to handle left-recursive productions of the form X -> X Y
bottom-up parsing - can be faster; find sequences that correspond to the righthand side of a grammar rule and replace them with the left
ex. shift-reduce parser
backtracking ensures we find a parse if one exists (and can find multiple if we choose)
dependency grammar focuses on how words relate to other words
relation between head (usually tensed verb) and its dependents (rest of the words)
this can also be displayed as a tree
a dependency graph is projective if, when all the words are written in order, the edges can be drawn above the words without crossing
equivalent to saying that a word and all its descendants (dependents and dependents of its dependents, etc.) form a contiguous sequence of words within the sentence
above graph is projective
valencies - sometimes certain things in a class are allowed buothers are not
e.g. âBuster was frightenedâ â but âBuster saw frightenedâ :x:
need subcategories of things (e.g. intransitive verb vs transitive verb vs dative verb) to know what symbols are allowed in rules
verbs specifically have complements associated with them â unlike modifiers (like the word âreallyâ), complemetes are not optional and usually not selected in the same way by the head
feature-based grammars - add things to word representations (e.g. plurality) and use these in the grammar rules
3.7.3. language#
Cleaning up the Brickyard: How Theory and Methodology Shape Experiments in Cognitive Neuroscience of Language (burghtâŠzaccarella, 2023) - 6 categories to study languages
semantics - study of the meaning of words / word combinations
syntax - study of the grammatical structure of phrases/sentences
phonetics - study of the physical aspects of sounds
phonology - study of the organization of sounds independent of their physical realization in speech / sign language
morphology - study of the internal structure of words in a language
pragmatics - study of the way in which context during communication contributes to meaning