transformers

# papers

## high-performing

• early papers
• attention is all you need (vaswani et al. 2017) - initial transformer
• encoder-decoder transformer for seq-to-seq (most new models don’t have special encoder-decoder structure for translation)
• Semi-supervised Sequence Learning (dai & quoc le, 2015)
• context vector is weighted sum of context vector at each word
• ULMFiT (howard & ruder, 2018)
• BERT (devlin et al. 2018) - semi-supervised learning (predict masked word - this is bidirectional) + supervised finetuning
• roberta (liu et al. 2019)
• BART (lewis et al. 2019) - generalizes BERT with sequence-to-squence training: train by (1) corrupting text then (2) reconstruct the original text
• ELMo (peters…zettlemoyer, 2018) - no word embeddings - train embeddings w/ bidirectional lstm (on language modeling)
• XLNet (yang…quoc le, 2020)
• GPT-3 (brown et al. 2020) - identitical to GPT-2 except larger and replaces dense attention with sparse attention
• Longformer: The Long-Document Transformer (beltagy, peters, & cohan, 2020) - processes very long contexts
• PaLM: Scaling Language Modeling with Pathways (google, 2022) - 540 Billion params
• pathways hardware center allows for fast/efficient training
• discontinuous improvements - at some point large model improves
• prompt engineering: “Explain yourself” - lets it explain jokes
• Chinchilla: Training Compute-Optimal Large Language Models (deepmind, 2022)
• “chinchilla scaling laws” - for compute-optimal training, the model size and the number of training tokens should be scaled equally
• T0 (sanh…rush, 2022) - multitask training enables better zero-shot generalization
• more effective training

other

## external knowledge / tool use / grounding

These are transformer-specific. For more general notes, see 📌 transfer learning or 📌 uncertainty. Most of these approaches can be combined with metalearning.

• finetuning
• finetune all DNN params
• finetune linear layer on activations
• standard - train linear model on the embedding of the first token (usually an added [CLS] token) (peters et al. 2018)
• finetune linear model on all the activations
• e.g. evci, et al. 2022 - learn linear layer (using group-lasso) on features extracted from all layers
• finetune specific DNN params (e.g. just the bias terms)
• adapter - finetune lightweight layers on top of pre-trained layers (between finetuning all layers, and just finetuning a new layer)
• ablate some model weights by training a binary mask over model parameters (Zhao et al., 2020; Radiya-Dixit and Wang, 2020)
• prompting = few-shot learning = priming = in-context learning (starts with GPT)
• prompting without changing any model parameters
• limitation: can’t exploit sets longer than the training window
• MetaICL: Learning to Learn In Context (min et al. 2022) - tune LLM to do in-context learning on a large set of training tasks (few-show prompting and training time and at test-time)
• Visual Prompting via Image Inpainting (bar…darrell, globerson, efros, 2022)
• PatternExploiting Training (PET) – Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference (schick & schutze, 2021)
• cloze questions - same as masked language modeling: task is to replace some missing words
• use cloze-question templates (e.g. it was “good” or “bad”) to get soft labels for unlabeled data and then finetune on theses
• prompt-tuning (also see next section on autoprompting)

mt-dnn line of work

• Multi-Task Deep Neural Networks for Natural Language Understanding (xiaodong liu … gao 2019) - multi-task learning on the 9 glue tasks (first layers are shared, then some task-specific layers at top)
• RAdam: On the Variance of the Adaptive Learning Rate and Beyond (liyuan liu…gao, han, 2020)
• usually need to do learning-rate warmup when trainin (e.g. with Adam)
• SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization (jiang…gao, zhao, 2020)
1. Smoothness-inducing regularization, which effectively manages the complexity of the model
2. Bregman proximal point optimization to prevent aggressive updating
• Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding (xiaodong liu…gao, 2020)
• Posterior Differential Regularization with f-divergence for Improving Model Robustness (hao cheng, …, gao 2021)
• regularize model posterior difference between clean + noisy inputs (e.g. adversarially attacked inputs)

## pruning

• SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot (frantar & alistarh, 2023) - prune GPT-style models to atleast 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy
• Cramming: Training a Language Model on a Single GPU in One Day (geiping & goldstein, 2022) - tricks for training BERT

# prompting

• Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing (liu…neubig, 2021)
• from feature-engineering -> architecture engineering -> prompt engineering
• LAMA Language Models as Knowledge Bases? (petroni…riedel, 2019) - Proposes using fill-in-the-blank (cloze) prompts for extracting knowledge from large language models
• create LAMA probe - dataset of (subject, relation, object) triplets with templates – find that BERT can recall these relations
• How to Query Language Models? (adolphs et al. 2021) - query LLMs by example (e.g. “Ronaldo plays for Portugal. Who does Neuer play for?”)
• How Can We Know What Language Models Know? (jiang … neubig, 2020)
• mining-based and paraphrasing-based methods to automatically generate high-quality diverse prompts
• ensemble methods to combine answers from different prompts (e.g. avg logits and more)
• Noisy Channel Language Model Prompting for Few-Shot Text Classification (min et al. 2022)
•  Querying $P(question answer)$ with Bayes rule outperforms standard querying $P(answer question)$
• memory-assisted prompt-editing (madaan…yang, 2022) - allows model to “save things to memory” that get added to prompt when needed
• Prompting Is Programming: A Query Language For Large Language Models (Beurer-Kellner, Fischer, & Vechev, 2022)

## llm chaining / decoding

many notes are from this thread on chaining models together

• steering
• overviews
• Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts (wu, terry, & cai, 2022) - chaining LLM steps together: output of one step becomes the input for the next
• interactive system where users can modify chains + their intermediate results – improves performance + human experience
• Language Model Cascades (dohan…sutton, 2022) - treat chaining models as probabilistic programs
• use a probabilistic-programming language (PPL) to define a joint probability model on string-valued random variables, parameterized using LMs, and then condition this model on string-valued observations in order to compute a posterior over string-valued unknowns
• self-PPLs extend probabilistic graphical models to support more complex joint distributions whose size and “shape” can itself be stochastic
• e.g., a graph unrolled for a random number of iterations, until a data-dependent stopping criterion is met
• variables are all text: questions $Q$, answers $A$, and intermediate thoughts $T$
• posthoc
• Chain of Thought Prompting (wei et al. 2022)
• in few-shot prompts, don’t just provide answer but also reasoning
• model output then provides reasoning + answer
• Self-Consistency Improves Chain of Thought Reasoning in Language Models (wang, wei, schuurmans, quoc le, … zhou, 2022) - sample a diverse set of reasoning paths from a language model via chain of thought prompting then return the most consistent final answer in the set
• Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them (suzgun, …, quoc le, …, jason wei, 2022)
• selection inference (creswell et al. 2022) - generate set of facts, then iteratively generate inferences from the facts to yield the final answer
• least-to-most prompting (zhou…quoc le et al. 2022) - prompt LLM with context showing how to reduce into subproblems; then LLM sequentially solves the subproblems, using the previous answers
• Generated Knowledge Prompting for Commonsense Reasoning (liu…hasjishirzi, 2021) - generate knowledge from an LLM then provide it as additional input when answering a question
• maieutic prompting (jung et al. 2022) - generate a tree of all explanation of the form “True, because…”, “False, because…” then query LLM with these as prompts
• then use Max-SAT to try to satisfy as many relations between the model explanations as possible to come up with the true answer
• training
• verifiers (cobbe et al. 2021) - train model to judge whether an answer and thought are likely to be “valid”
• subgoal search (czechowski et al. 2021) - train model to generate subgoals then solve them in a graph
• STaR “Self-taught reasoner” (zelikman…goodman, 2022)
• first, finetune on observed $(Q, T, A)$ triplets
• then, impute unknown $T_i$ given dataset of pairs $(Q_i, A_i)$ by sampling until finding a $T_i$ which leads to the correct answer
• robotics-specific

# misc

## direct weight inspection

• nice paper list here

• all layers are same dimension and each attention block adds a vector to it
• Although they’re parameterized as separate matrices, $W_O W_V$ and $W_Q^T W_K$ can always be thought of as individual, low-rank matrices
• $x \in \mathbb R^{d_{embed} \times d_{sequence}}$: $d_{embed}$ can be hundreds - tens of thousands
• $W_Q, W_K, W_V \in \mathbb R^{d_{attn} \times d_{embed}}$
• $W_Q^TW_k \in \mathbb R ^{d_{embed} \times d_{embed}}$
• $W_O \in \mathbb R^{d_{embed} \times d_{attn}}$: projects attention values back to embedding dimention
• $W_O W_V \in \mathbb R ^{d_{embed} \times d_{embed}}$
• $W_E \in \mathbb R^{d_{embed} \times d_{vocab}}$ embeds initial tokens and $W_U \in \mathbb R^{d_{vocab} \times d_{embed}}$ undoes the embedding
• $d_{vocab}$ can be very large, e.g. 50k
• $A = \text{softmax}(x^TW_Q^TW_kx) \in \mathbb R^{d_{sequence} \times d_{sequence}}$
• if we have a 0-layer net (e.g. predict next token with linear layer given current token), we just learn bigram log-likelihood
• 2 circuits
• QK circuit determines which “source” token the present “destination” token attends back to and copies information from
• $W_{E}^{T} W_{Q}^{T} W_{K} W_{E} \in \mathbb R ^{d_{vocab} \times d_{vocab}}$
• OV circuit describes what the resulting effect on the “out” predictions for the next token is
• $W_{U} W_{O} W_{V} W_{E} \in \mathbb R ^{d_{vocab} \times d_{vocab}}$
• if a single head increases the probability of both keep… in mind and keep… at bay, it must also increase the probability of keep… in bay and keep… at mind
• induction heads search previous examples of present token
• If they don’t find it, they attend to the first token and do nothing
• if they do find it, they then look at the next token and copy it. This allows them to repeat previous sequences of tokens, both exactly and approximately
• sometimes can do some kind of “fuzzy” matching
• tensor/kronecker product $\bigotimes$:
• Left-right multiplying: Multiplying $x$ by a tensor product $A \otimes W$ is equivalent to simultaneously left and right multiplying: $(A \otimes W) x=A x W^{T}$
• When we add them, it is equivalent to adding the results of this multiplication: $\left(A_{1} \otimes W_{1}+A_{2} \otimes W_{2}\right) x=A_{1} x W_{1}^{T}+A_{2} x W_{2}^{T}$

Softmax Linear Units

• replacing activation function with softmax linear unit increases fraction of MLP neurons which are “interpretable”, i.e. correspond to meaningful features
• however, may “hide” some non-neuron-aligned features by decreasing their magnitude and then later recovering it with LayerNorm
• the presence of nonlinear activation functions createse an incentive for features to align with this basis and not get superposed
• if the gains to sparse coding are large enough, this incentive will get overwhelmed
• ways to combat polysemanticity
• activation sparsity
• lateral inhibition / co-occurrence sparsity
• weight sparsity
• superlinear activation functions
• increase neurons per param
• $\text{SoLU}(x) = x \cdot \text{softmax}(x)$
• adds lateral inhibition, superlinearity, approximate sparsity
• changes GeLU, which is approximately $\text{sigmoid}(1.7x) \cdot x$
• just changing to SoLU decrease performance, had to add LayerNorm afterwards
• Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors (yun, chen, olshausen, lecun, 2021) - investigate LLM embeddings of different words using dictionary learning
• LLMs produce interesting contextualized word embeddings
• dictionary elements (of activations across layers) correspond to meaningful things
• A Circuit for Indirect Object Identification in GPT-2 small (wang, …, steinhardt, 2022)
• explanation encompasses 26 attention heads grouped into 7 main classes
• task: indirect object identification - “When Mary and John went to the store, John gave a drink to ___” should be “Mary”
• circuit
• identify all previous names
• remove duplicated names
• output remaining name
• Finding Skill Neurons in Pre-trained Transformer-based Language Models - some individual neurons are predictive of the final task (dubbed “skill neurons’)

## editing

• Locating and Editing Factual Associations in GPT (meng, bau et al. 2022 )
• localize factual associations - causal intervention for identifying neuron activations that are decisive in a model’s factual predictions
• “causal traces” - run net multiple times, introducing corruptions and then restore states from original non-corrupted forward pass to see which states can restore the original results
• a small number of states contain info that can flip the model from one state to another
• change factual associations - modify feedforward weights to update specific factual associations using Rank-One Model Editing (ROME)
• Mass Editing Memory in a Transformer (meng…, bau, 2022)
• Knowledge Neurons in Pretrained Transformers (dai et al. 2021) - integrated gradients wrt to each neuron in BERT
• Memory-Based Model Editing at Scale (mitchell…manning, finn, 2022)
• keep track of list of edits in external memory and use them as appropriate context at test time (don’t finetune the model)
• Fast model editing at scale (mitchell…finn, manning, 2022)
• a collection of small auxiliary editing networks that use a single desired input-output pair to edit a pre-trained model
• MEND learns to transform the gradient obtained by standard fine-tuning, using a low-rank decomposition of the gradient

## symbolic reasoning

• GPT-3 Large Language Models are Zero-Shot Reasoners - simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3

• Compositional processing emerges in neural networks solving math problems (russin, roland fernandez, …, smolensky, gao, 2021)

• neurocompositional computing (smolensky…gao, 2022)
• longer tutorial (smolensky, …, gao, 2022)

• central paradox of cognition is that brain both uses continuous neural symbols but is compositional (smolensky et al. 1992)
• Compositionality
• Continuity - the encoding and processing of information is formalized with real numbers that vary continuously
• 3 challenges
• compositional generalization
• data efficiency
• comprehensibility
• solution - NECST: Neurally-Encoded Compositionally-Structured Tensor computing (smolensky & legendre, 2006) - basically leverages TPR
• TPR roles and fillers can both be made continuous
• neural space vs symbolic space (many different things (e.g. sentences) can mean the same thing)
• word vectors can be thought of as “soft symbols”
• want to move from symbolic repr. to neural repr. while keeping interpretability
• thinking fast (system 1: fast, intuitive) + slow (system 2: slower, logical, derivative)
• concrete proposals
• transformer activation vector should encode graph of flow through the network
• ex. task: regurgitate a sequence
• TPR: Tensor product variable binding and the representation of symbolic structures in connectionist systems (paul smolensky, 1990) - activation patterns are “symbols” and internal structure allows them to be processed like symbols
• tensor product representation = TPR
• TPR slides
• TPR of a structure is the sum of the TPR of its constituents
• tensor product operation allows constituents to be uniquely identified, even after the sum (if roles are linearly independent)
• filler - one vector that embeds the content of the constituent
• role - second vector that embeds the structural role it fills
• NECSTransformer: Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving (schlag, …, gao, 2019)
• TP-attention
• beat SOAon free-form math word-problems
• do element-wise multiplication of outputted vector with role-vector
• TPR built as tensor product of 2 vectors:
• filler - the vector returned by attention
• ex. one head learns “second-argument-of”
• role - a relation conceptually labeling an edge of the attention graph
• TP-N2F: Tensor Product Representation for Natural To Formal Language Generation - Microsoft Research (chen…gao, 2019)

## sparse experts / ensembles / mixture of experts (MoE)

mixture of experts models have become popular because of the need for (1) fast speed / low memory at test time while still (2) having a large model during training

• note: nowadays often the “experts” are different MLPs following the self-attention layers
• A Review of Sparse Expert Models in Deep Learning (fedus, jeff dean, zoph, 2022)
• sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models
• routing algorithm - determines where to send examples
• discreteness makes it difficult
• some works use RL to learn routing
• standard approach uses gumbel-softmax
• usually get matrix of similarities between input tokens and experts and route based on these
• sometimes route to topk experts rather than top1
• load balancing - usually add an auxiliary loss to encourage equal tokens being sent to different experts
• non-specialized experts
• routing notes - make hard decision but still want to learn probabilities
• straight-through estimator (STE) - take the argmax during the forward pass, while considering the orig- inal probabilities in the backward pass
• highly biased
• gumbel-softmax- allows for better sampling
• specialized experts as fully independent models (sometimes for multi-task learning)
• Towards Understanding Mixture of Experts in Deep Learning (chen…gu, li, 2022)
• ensembles (some of these are non-transformer papers)
• model soups (wortsman…schmidt, 20221) - average weights of finetuned models
• snapshot ensembles - average different checkpoints during training (huang et al. 2017)
• stochastic weight averaging (izmailov, …, wilson, 2019) - average multiple checkpoints during training
• batch ensemble (wen et al. 2020) - have several rank-1 keys that index different weights hidden within one neural net
• fit many models into one
• superposition of many models into one (cheung…olshausen, 2019) - both during training/testing models are indexed via a high-dim key for each task
• supermasks in superposition (wortsman, …, yosinski, farhadi, 2020) - randomly fixed based net + for each task finds subnet that chieves good performance
• if task identity not given, correct subnet inferred by minimizing output entropy
• Git Re-Basin: Merging Models modulo Permutation Symmetries (ainsworth, hayase, & srinivasa, 2022) - algo to merge models even when they haven’t been pretrained together
• early exit - popular way to speed up inference

• Multi-exit vision transformer for dynamic inference (Bakhtiarnia, A., Zhang, Q. and Iosifidis, A., 2021)

• early layers have large activation map so early exist classifier must be complex
• solution: ViT class token allows early-exit classifier to have constant complexity
• DeeBERT: Dynamic early exiting for accelerating BERT inference (xin…lin, 2020)

## connecting with rules

• Automatic Rule Extraction from Long Short Term Memory Networks (murdoch & szlam, 2017) - extract out phrases using feature importance
• A Comparative Study of Rule Extraction for Recurrent Neural Networks (wang et al. 2018) - create automata based on interpretable states to track RNNs

• Forecasting Future World Events with Neural Networks (zou…hendrycks, 2022) - takes tasks from metaculus

• forecasting paper titles (blog post)

• Neurosymbolic Programming for Science (sun…costilla-reyes, 2022)

• Learning from learning machines: a new generation of AI technology to meet the needs of science (berkeley+lbnl+, 2021)

• scientific organization (galactica)

• related but smaller models
• all data is processed in a common markdown format

• task-specific tokens to support different types of knowledge (e.g. citations, step-by-step reasoning, different modalities, e.g. proteins)
• chemical compounds (train on 2 mil / 110 mil from PubChem Compound, authors still want it to focus on text)
• predict IUPAC name from SMILES formula e.g. CC(C)(C)C(=O)N(CC1=NC(=CS1)C(=O)OC)C2CCCCC2 -> methyl 2-[[cyclohexyl-(2,2-dimethylpropanoyl)]amino] methyl]thiazole-4-

• moleculenet (wu et al. 2017) classification benchmark (6 tasks)

• training set examples are trained as text during fitting

• HIV - classify whether comopund inhibits HIV replication
• BACE C - binding results (classification + regression) for BACE
• BBBP - blood-brain barrier penetration(permeability) (binary classification)
• Tox21 - qualitative toxicity on 12 targets (12-class multilabel binary)
• SIDER - 27-class multi-class disorders in different organ systems
• ClinTox - binary toxicity classification
• ex. for BBBP (one of the 6 tasks) - question is posed in different ways during training

Here is a SMILES formula:
[START_I_SMILES]O=C(O)CCCC1=CC=C(N(CCCl)CCCl)C=C1[END_I_SMILES]

Question: Will the chemical compound penetrate the blood-brain barrier?

• protein sequences
• from 227 million in UniProt, look at only 0.5 million subset (called Swiss-Prot)
• evaluate protein sequence perplexity
• protein keyword prediction (predict keywords in UniProt, like “ATP-Binding”, “Cell membrane”)
• protein function description - compare free-form description to GT UniProt function description

# basics

• attention = vector of importance weights
• to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with (or “attends to” other elements and take the sum of their values weighted by the attention vector as the approximation of the target
• self-attention layer implementation, mathematics, and chandan’s self-attention cheat-sheet

## mathematical overview of transformers (Formal Algorithms for Transformers)

•  sequence modeling: learn $p(x)$, usually factorized as $p(x_i x_1,…,x_{i-1})$
•  sequence-to-sequence: learn $p(z x)$, e.g. transalation, speech-to-text, question answering
• preprocessing
• embedding matrix takes in one-hot tokens and linearly maps them to a vector
• positional embedding of a token is usually added to the token embedding to form a token’s initial embedding
• attention types
• Bidirectional / unmasked self-attention - primary/context vectors are the same
• Unidirectional / masked self-attention - mask scores from before a given word
• Cross-attention - primary/context vectors can come from different places
• non-attention
• layernorm: controls mean/variance of activations
• RMSnorm: simpler version, sets mean/offset to zero
• unembedding
• linear layer (with softmax) that outputs size of original vocab
• sometimes fixed to be transpose of the embedding matrix
• predictions
• predict next word using single linear layer on hidden state from previous word
• finetune classification head often only using linear layer on first token from sequence
• architectures
• initially, encoder-decoder was common, but now often no decoder

## visual explanation (notes on article by jay allamar)

• **self-attention ** - layer that lets word learn its relation to other layers
• for each word, want score telling how much importance to place on each other word (queries $\cdot$ keys)
• we get an encoding for each word
• the encoding of each word returns a weighted sum of the values of the words (the current word gets the highest weight)
• softmax this and use it to do weighted sum of values
• (optional) implementation details
• multi-headed attention - just like having many filters, get many encodings for each word
• each one can take input as the embedding from the previous attention layer
• position vector - add this into the embedding of each word (so words know how far apart they are) - usually use sin/cos rather than actual position number
• residual + normalize - after self-attention layer, often have residual connection to previous input, which gets added then normalized
• decoder - each word only allowed to attend to previous positions
• 3 components
• queries
• keys
• values
• attention
• encoder reads input and ouputs context vector after each word
• decoder at each step uses a different weighted combination of these context vectors
• specifically, at each step, decoder concatenates its hidden state w/ the attention vector (the weighted combination of the context vectors)
• this is fed to a feedforward net to output a word
• at a high level we have $Q, K, V$ and compute $\text{softmax}(QK^T)V$
• instead could simplify it and do $\text{softmax}(XX^T)V$ - this would then be based on kernel
• transformer
• uses many self-attention layers
• many stacked layers in encoder + decoder (not rnn: self-attention + feed forward)
• details
• initial encoding: each word -> vector
• each layer takes a list of fixed size (hyperparameter e.g. length of longest sentence) and outputs a list of that same fixed size (so one output for each word)
• can easily train with a masked word to predict the word at the predicted position in the encoding
• multi-headed attention has several of each of these (then just concat them)

## huggingface tutorial

Broadly, models can be grouped into three categories:

• GPT-like (also called auto-regressive Transformer models)
• BERT-like (also called auto-encoding Transformer models)
• BART/T5-like (also called sequence-to-sequence Transformer models)
• Handling multiple sequences - Hugging Face Course
• pad sequences to have the same length (need to modify attention masks to ignore the padded values)

### pre-transformer nlp models

• rnns
• when training rnn, accumulate gradients over sequence and then update all at once
• stacked rnns have outputs of rnns feed into another rnn
• bidirectional rnn - one rnn left to right and another right to left (can concatenate, add, etc.)
• standard seq2seq
• encoder reads input and outputs context vector (the hidden state)
• decoder (rnn) takes this context vector and generates a sequence