transformers view markdown

Broad-ranging notes on papers involving transformers. Biased towards things I find cool - neuroscience, trees, and automatic science.


See related papers in the đź“Ś interpretability page.


nlp (see also this link)

  • early papers
    • attention is all you need (vaswani et al. 2017) - initial transformer
      • encoder-decoder transformer for seq-to-seq (most new models don’t have special encoder-decoder structure for translation)
      • Semi-supervised Sequence Learning (dai & quoc le, 2015)
        • context vector is weighted sum of context vector at each word
    • ULMFiT (howard & ruder, 2018)
  • BERT (devlin et al. 2018) - semi-supervised learning (predict masked word - this is bidirectional) + supervised finetuning
    • roberta (liu et al. 2019)
    • BART (lewis et al. 2019) - generalizes BERT with sequence-to-squence training: train by (1) corrupting text then (2) reconstruct the original text
    • ELMo (peters…zettlemoyer, 2018) - no word embeddings - train embeddings w/ bidirectional lstm (on language modeling)
    • XLNet (yang…quoc le, 2020)
  • GPT-4 (openai, 2023) - adds multimodal understanding + boosts context length to 32k
  • ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (clark…quoc le, chris manning, 2020)

    • more efficient: rather than standard masked training, use generator-discriminator setup for “token detection”
      • generator replaces many masked tokens with plausible samples - train with MLM
      • discriminator tries to guess which tokens were the masked ones - this is the main model that gets used
  • LongNet: Scaling Transformers to 1,000,000,000 Tokens (ding, …, wei, 2023) - multiscale attention similar to wavelets
  • PaLM: Scaling Language Modeling with Pathways (Google 2022) - 540 Billion params
    • pathways hardware center allows for fast/efficient training
    • discontinuous improvements - at some point large model improves
    • prompt engineering: “Explain yourself” - lets it explain jokes
    • Chinchilla: Training Compute-Optimal Large Language Models (DeepMind 2022)
      • “chinchilla scaling laws” - for compute-optimal training, the model size and the number of training tokens should be scaled equally
  • T0 (sanh…rush, 2022) - multitask training enables better zero-shot generalization
  • early instruction following

  • subquadratic attention
  • smaller newer models
    • phi-1, phi-2
    • mistral 7B, mixtral MoE



  • Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing (liu…neubig, 2021)
    • from feature-engineering -> architecture engineering -> prompt engineering
    • prompting_typology
  • early prompting papers
    • LAMA Language Models as Knowledge Bases? (petroni…riedel, 2019) - Proposes using fill-in-the-blank (cloze) prompts for extracting knowledge from large language models
      • create LAMA probe - dataset of (subject, relation, object) triplets with templates – find that BERT can recall these relations
      • How to Query Language Models? (adolphs et al. 2021) - query LLMs by example (e.g. “Ronaldo plays for Portugal. Who does Neuer play for?”)
      • How Can We Know What Language Models Know? (jiang … neubig, 2020)
        • mining-based and paraphrasing-based methods to automatically generate high-quality diverse prompts
        • ensemble methods to combine answers from different prompts (e.g. avg logits and more)
      • Noisy Channel Language Model Prompting for Few-Shot Text Classification (min et al. 2022)
      • Querying $P(question answer)$ with Bayes rule outperforms standard querying $P(answer question)$



  • natural-language prompting
    • iPrompt: Explaining Patterns in Data with Language Models via Interpretable Autoprompting (singh, morris, …gao, 2022)
    • APE: Large Language Models Are Human-Level Prompt Engineers (zhou…ba, 2022)
      • similar to iPrompt, (1) propose prompt candidates with an LLM, (2) score the prompts by the accuracy they yield when using another LLM and (3) regenerate similar prompt candidates
      • experiments on instruction induction datasets + truthful QA
    • FluentPrompt: Toward Human Readable Prompt Tuning (shi, …, zettlemoyer, 2022) - use langevin sampling + fluency constraint to generate prompt
      • experiments relatively weak: 3 sentiment datasets + autoprompt is the only baseline
    • APO: Automatic Prompt Optimization with “Gradient Descent” and Beam Search (pryzant…zeng, 2023) - update prompts based on errors made by previous prompts
    • OPRO: Large Language Models as Optimizers (yang…quoc le, zhou, & chen , 2023) - add in past prompts with their scores during optimization
    • Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (fernando…rocktaschel, 2023) - simultaneously improve prompts with LLM + improve the mutation-prompts the LLM uses to mutate the prompts
    • Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers (guo…yang, 2023)
    • PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization (wang…hu, 2023) - iterate on prompt errors using MC tree search
    • Language Models as Black-Box Optimizers for Vision-Language Models (yu…pathak, & ramanan, 2023)
  • discrete prompting
  • continuous prompt optimization
    • Prefix-Tuning: Optimizing Continuous Prompts for Generation (li & percy liang, 2021) – optimizes in continuous space for language generation tasks
      • learn to map some parameters $\theta$ through and MLP to generate a starting hidden state $h_i$ – never actually sends the prefix through the network
    • P-Tuning: GPT Understands, Too (liu et al. 2021) – use LSTM to generate prompt embeddings (don’t map to tokens)
    • Control Prefixes for Parameter-Efficient Text Generation (clive, cao, & rei, 2022) - allow for adapting the prefix to each input example
      • DART: Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners (zhang…chen, 2022)
        • reformulating NLP task into differentially optimizing the prompt template + target label (given a pre-trained model)
        • focus on smaller models (Roberta-large + GPT-2) + few training shots
        • fluency constraint to ensure association among prompt embeddings
    • WARP: Word-level Adversarial ReProgramming (Hambardzumyan et al. 2021) - add continous tokens + some task-specific parameters for better generalization
    • KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction (Chen et al. 2021) – incorporate relations, visualize learned prompt vectors with t-SNE
  • critiques of prompting
    • Do Prompt-Based Models Really Understand the Meaning of their Prompts? (webson & pavlick, 2022) - models can learn fine with prompts that are intentionally irrelevant
    • Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity (lu…riedel, stenetorp, 2021)
    • Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting (sclar, choi…, suhr, 2023)
  • misc
    • Context-faithful Prompting for Large Language Models (zhou, shang, poon & chen, 2023) - ask question in clever way to force LLM to follow it
    • SentiPrompt: Sentiment Knowledge Enhanced Prompt-Tuning for Aspect-Based Sentiment Analysis (Zhang et al. 2021) - use sentiment knowledge penalties in the prompt
    • Meta-learning via Language Model In-context Tuning (Chen et al. 2022) - given new task with new instruction
    • Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm (Reynolds & McDonell, 2021) - define metaprompts as general wrappers around tasks e.g. “This problem asks us to”
    • Re3: Generating Longer Stories With Recursive Reprompting and Revision (Yang et al. 2022) - generate summaries, then expand and revise with prompts
    • Directional Stimulus Prompting (li, baoling peng, …jianfeng gao, xifeng yan, 2023) - generate hint keywords using small LLM that are put into the prompt when calling large LLM
    • memory-assisted prompt-editing (madaan…yang, 2022) - allows model to “save things to memory” that get added to prompt when needed
    • Prompting Is Programming: A Query Language For Large Language Models (Beurer-Kellner, Fischer, & Vechev, 2022)
  • can benefit from training for promptability

llm chaining / decoding

many notes are from this thread on chaining models together

  • overviews
    • Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts (wu, terry, & cai, 2022) - chaining LLM steps together: output of one step becomes the input for the next
      • interactive system where users can modify chains + their intermediate results – improves performance + human experience
    • Language Model Cascades (dohan…sutton, 2022) - treat chaining models as probabilistic programs
      • use a probabilistic-programming language (PPL) to define a joint probability model on string-valued random variables, parameterized using LMs, and then condition this model on string-valued observations in order to compute a posterior over string-valued unknowns
      • self-PPLs extend probabilistic graphical models to support more complex joint distributions whose size and “shape” can itself be stochastic
        • e.g., a graph unrolled for a random number of iterations, until a data-dependent stopping criterion is met
        • variables are all text: questions $Q$, answers $A$, and intermediate thoughts $T$
  • posthoc
    • understanding chain-of-thought and its faithfulness
      • Faithful Chain-of-Thought Reasoning (yu et al. 2023)
      • Contrastive Chain-of-Thought Prompting (chia…bing, 2023)
      • Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks (chen et al. 2022)
      • Critiques
        • Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations (chen, zhong, …, steinhardt, yu, mckeown, 2023)
        • The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning (ye & durrett, 2022)
        • Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (turpin, …, bowman, 2023)
          • CoT explanations can be heavily influenced by biasing the model towards certain answers, thereby yielding invalid explanations
          • try biasing in 2 ways: answer is always (A), or setting where prompt suggests a certain answer
        • Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs (chen, …, bowman, cho, 2023) - models fail at these 2 tasks:
          • hypothetical consistency (the ability for a model to predict what its output would be in a hypothetical other context)
          • compositional consistency (consistency of a model’s outputs for a compositional task even when an intermediate step is replaced with the model’s output for that step)
      • faithfulness metric = model sensitivity to removing some of the explanation
        • Question Decomposition Improves the Faithfulness of Model-Generated Reasoning (anthropic, 2023) - introduce factored decomposition to improve faithfulness metric
        • Measuring Faithfulness in Chain-of-Thought Reasoning (anthropic, 2023) - in addition to just removing some of the explanation, also add mistakes to it / paraphrase it
          • larger models become less faithful by this metric
        • Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI (sia…zettlemoyer, mathias, 2023)
      • Towards Consistent Natural-Language Explanations via Explanation-Consistency Finetuning (chen…gao, 2024)
      • Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals (elazar…sameer singh, noah smith, 2023)
      • Faithful Explanations of Black-box NLP Models Using LLM-generated Counterfactuals (gat…reichart, 2023)
      • Counterfactually Aware Fair Text Generation (banerjee…bhatia, 2023)
      • Causal Proxy Models for Concept-based Model Explanations (wu…potts, 2023)
      • Evaluating Models’ Local Decision Boundaries via Contrast Sets (gardner…zhou, 2020)
      • Are Large Language Models Post Hoc Explainers? (kroeger…lakkaraju, 2023)
    • Followups to Chain of Thought Prompting (wei et al. 2022)
      • in few-shot prompts, don’t just provide answer but also reasoning
      • model output then provides reasoning + answer
      • Self-Discover: Large Language Models Self-Compose Reasoning Structures (zhou…le…zheng, 2024) - LLMs come up with their own step-by-step structure for a task
      • Self-Consistency Improves Chain of Thought Reasoning in Language Models (wang, wei, schuurmans, quoc le, … zhou, 2022) - use output samples rather than greedy and return the most consistent final answer in the set
      • Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them (suzgun, …, quoc le, …, jason wei, 2022)
      • self-ask (Press et al., 2022) - LLM asks itself (and then answers) follow-up questions before answering the initial question
      • Text Classification via Large Language Models (sun…wang, 2023) - add clues to the prompt
      • Let’s Do a Thought Experiment: Using Counterfactuals to Improve Moral Reasoning (ma, …, chen, 2023) - counterfactuals help improve CoT
      • RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought (xue et al. 2023)
      • SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning (miao, teh, & rainforth, 2023)
      • EchoPrompt: Instructing the Model to Rephrase Queries for Improved In-context Learning (mekala…sameer singh, 2023) - replace let’s think step by step with Let’s repeat the question and also think step by step
    • scratchpads Show Your Work: Scratchpads for Intermediate Computation with Language Models (nye et al. 2021)
    • selection inference (creswell et al. 2022) - generate set of facts, then iteratively generate inferences from the facts to yield the final answer
    • least-to-most prompting (zhou…quoc le et al. 2022) - prompt LLM with context showing how to reduce into subproblems; then LLM sequentially solves the subproblems, using the previous answers
    • Generated Knowledge Prompting for Commonsense Reasoning (liu…hasjishirzi, 2021) - generate knowledge from an LLM then provide it as additional input when answering a question
    • maieutic prompting (jung et al. 2022) - generate a tree of all explanation of the form “True, because…”, “False, because…” then query LLM with these as prompts
      • then use Max-SAT to try to satisfy as many relations between the model explanations as possible to come up with the true answer
    • LM vs LM: Detecting Factual Errors via Cross Examination (cohen et al. 2023)
    • decoding
    • fast decoding
      • KV caching + some other tricks - if repeatedly using the same tokens at the beginning of the context, can cache the KV vectors for those tokens
        • KV caching trades off speed with memory
      • speculative decoding (leviathan, kalma, & matias, 2022) - decode multiple tokens in parallel with small model, potentially skipping steps for the large model
    • early exit - popular way to speed up inference
      • Multi-exit vision transformer for dynamic inference (Bakhtiarnia, A., Zhang, Q. and Iosifidis, A., 2021)
        • early layers have large activation map so early exist classifier must be complex
        • solution: ViT class token allows early-exit classifier to have constant complexity
      • DeeBERT: Dynamic early exiting for accelerating BERT inference (xin…lin, 2020)
  • prompt ensembles
    • liu…neubig, 2023 review discusses different strategies for ensembling prompts, e.g. averaging, weighted averaging
    • black-box querying
      • Tree-Prompting (morris…deng, 2023)
      • PromptBoosting: Black-Box Text Classification with Ten Forward Passes (hou, …, jacob andreas, …, zhang, 2022) - get a small pool of prompts, learn a verbalizer (final classification layer) for each, then ensemble them with AdaBoost on LLM output
      • people have studied many works on prompt ensembling (e.g. lester et al. 2021)
      • Boosted Prompt Ensembles for Large Language Models (pitis…ba, 2023) - similar but use CoT-style prompts and tasks, e.g. GSM8k
      • PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine (zhang…cai, 2023) - builds set of prompts dynamically rather than assuming they’re fixed
      • PTR: Prompt Tuning with Rules for Text Classification (han et al. 2021) – use logic rules to construct prompts with sub-prompts for many-class text classification (prompt is constructed hierarchically, but only one call is made to the LLM for inference)
    • soft prompts
      • Learning How to Ask: Querying LMs with Mixtures of Soft Prompts (Qin & Eisner, 2021) - learn a mixture of soft prompts using gradient descent
    • require model retraining
      • PRBOOST: Prompt-Based Rule Discovery and Boosting for Interactive Weakly-Supervised Learning (zhang…zhang, 2022) - iteratively (1) select high-error examples, (2) have human label them as rules, and (3) use boosting to train model on the new rules + ensemble
      • typical rule generation
        • Snuba (Varma and RĂ©, 2018) generates heuristics based on a small labeled dataset with pre-defined rule types
        • TALLOR (Li et al. 2021a) & GLaRA (Zhao et al. 2021) study rule expansion for NER problem based on lexical information and then select rules based on a hand-tuned threshold
    • Prompt ensembling / selection without labels
      • Zero-Label Prompt Selection (liao, zheng, & yang, 2022) - use prompts to label unlabeled data and then select prompts using these labels
      • A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models (alingham…lakshminarayanan, 2023) - use confidence (max output logit) after appropriate normalization as weight
  • self-verification

llm querying / causal inference

  • Can Large Language Models Infer Causation from Correlation? (jin…scholkopf, 2023) - introduce Corr2Cause dataset (must infer causal graph from correlational statements), doesn’t test pre-existing knowledge
  • Causal Reasoning and Large Language Models: Opening a New Frontier for Causality (kiciman…tan, 2023)
    • LLMs to be used alongside existing causal methods, as a proxy for human domain knowledge and to reduce human effort in setting up a causal analysis
      • cause-effect pairs, LLM has to discover from graph (tubingen benchmark, neuropathic pain, etc.)
  • Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond (feder…vetich, diyi yang, 2022)
  • Zero-shot causal learning (nilforoshan…leskovec, 2023)
  • Discovering Latent Knowledge in Language Models Without Supervision (burns, ye, klein, & steinhardt, 2022) - identify whether text is true or false directly from a model’s unlabeled activations
  • InferBERT: A Transformer-Based Causal Inference Framework for Enhancing Pharmacovigilance (wang…liu, 2021) - learn + test feature relationships from attention weights
  • CausaLM: Causal Model Explanation Through Counterfactual Language Models (2021) - produce example-level causal model explanations using models finetuned on auxiliary adversarial tasks derived from the causal graph of the problem
  • Investigating Gender Bias in Language Models Using Causal Mediation Analysis (vig, …, shieber, 2020)
    • Applies causal mediation analysis to identify decisive neurons and attention heads responsible for gender bias in large language models
    • Identifies a small handful of decisive attention heads in this case
  • Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals (elazar, …, goldberg, 2021) - measure the importance of specific info within a model by introducing a causal intervention to erase that information, then observing the causal effects
  • TrustLLM (sun…zhao, 2024) - evaluation and benchmark of many aspects of trustworthiness (github)
  • Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability (aykurek…andreas, 2024) - LMs generate additional text implied by documents, reason about the generated text, and finetune on the correct text
    • LMs’ reasoning capabilities during inference can be leveraged during training to improve their reliability
  • uncertainty
    • Semantic Uncertainty (kuhn, gal, & farquhar, 2023) - instead of calculating entropy over tokens, first generate set of answers, then cluster them base on semantic equivalence, before computing entropy
      • clustering is done via an LM that tests entailment e.g. E.g., “The capital of France is Paris.” entails “Paris is the capital of France.” because they mean the same thing
    • Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs (xiong…hooi, 2023)
      • verbalized uncertainty - model outputs its own uncertainty
      • consistency-based uncertainty - consistency between output generations
    • Quantifying Uncertainty in Natural Language Explanations of Large Language Models (tanneru…lakkaraju, 2023)
      • probing uncertainty (like consistency-based uncertainty above) - applies input perturbations (e.g., paraphrasing) and measure the consistency of the resulting explanations
      • verbalized uncertainty of explanations often performs poorly
    • Relying on the Unreliable: The Impact of Language Models’ Reluctance to Express Uncertainty (zhou…sap, 2024)
      • LMs are often unable to express uncertainties
      • LM confidences tend to be overconfident
      • users rely heavily on LM generations, whether or not they are marked by certainty
    • Teaching Models to Express Their Uncertainty in Words (Lin et al., 2022) - GPT3 can generate both an answer and a level of confidence (e.g. “90% confidence”)
    • Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling (hou…zhang, 2023)


retrieval-augmented-generation (RAG) + tool use

  • private
    • - nice demo adding citation to each fact
    • langchain library
    • - provide tools for wrapping APIs in LLM + interaction through router (also default modules for stateful storage, user identity, etc.)
  • Augmented Language Models: a Survey (meta, 2023) - 3 categories: reasoning, tools, action
  • Infer–Retrieve–Rank: In-Context Learning for Extreme Multi-Label Classification (D’Oosterlinck, …, potts, 2024)
    1. Infer: an LM processes the input document and guesses a set of applicable terms
    2. Retrieve: a retriever relates each predicted term to the actual label space
    3. Rank: Finally, an LM is used to rerank retrieved labels
  • Toolformer: Language Models Can Teach Themselves to Use Tools (meta, 2023) - model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction
    • Given input, sample position and API call candidates, try them all, and filter out ones which do not reduce next-token loss
      • put correct API calls into prompt, e.g. Pittsburgh is also known as [QA(What ...?→ Steel City)] the Steel City.
    • Training
      • start with few human-written examples of API use
      • LLM generates more uses
      • self-supervised loss determines which calls help with future-token prediction
    • Atlas: Few-shot Learning with Retrieval Augmented Language Models (meta, 2022)
  • Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (asai…hajishirzi, 2023) - train an LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens, called reflection token
  • Active RAG (jiang…neubig, 2023) - propose FLARE, which iteratively uses a prediction of the upcoming sentence to anticipate future content, which is then utilized as a query to retrieve relevant documents to regenerate the sentence if it contains low-confidence tokens
  • retreival-augmented in-context learning (put retrieved info into context, or something very similar)
    • REALM (guu, …, chang, 2020) - retrieves document chunks from corpus and adds them to context, for open-domain QA
    • RETRO (deepmind, 2022) - nearest neighbors to model’s input are retrieved, encoded, and conditioned on with chunked cross-attention
    • Decomposed prompting (khot et al., 2022) - decompose tasks via prompting which are delegated to a shared library of prompting-based LLMs dedicated to these sub-tasks
    • LLM-Augmenter (peng, galley…gao, 2023) - (1) consolidates evidence from external knowledge for the LLM to generate responses grounded in evidence, and (2) revising LLM’s (candidate) responses using automated feedback
    • Knowledgeable Prompt-tuning (Hu et al. 2021) - add knowledge-base info into the prompt search
  • memorizing transformers (wu…szegedy, 2022) - knn-based learned indexing + retrieval at training time
    • at test time, you just need to index the entire context and the model will be able to use it
    • kNN Prompting: Learning Beyond the Context with Nearest Neighbor Inference (xu…zhang, 2023) - instead of verbalizer, use nearest-neighbor
      • has dbpedia results
    • kNN-Prompt: Nearest Neighbor Zero-Shot Inference (shi…zettlemoyer, 2022)
  • original
  • webgpt (nakano, …, schulman, 2022, OpenAI) - allows google search to add world info

adaptation / transfer

These are transformer-specific. For more general notes, see đź“Ś transfer learning or đź“Ś uncertainty. Most of these approaches can be combined with metalearning.

mt-dnn line of work

  • Multi-Task Deep Neural Networks for Natural Language Understanding (xiaodong liu … gao 2019) - multi-task learning on the 9 glue tasks (first layers are shared, then some task-specific layers at top)
  • RAdam: On the Variance of the Adaptive Learning Rate and Beyond (liyuan liu…gao, han, 2020)
    • usually need to do learning-rate warmup when trainin (e.g. with Adam)
    • RAdam = add a term to rectify the variance of the adaptive learning rate in Adam
  • SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization (jiang…gao, zhao, 2020)
    1. Smoothness-inducing regularization, which effectively manages the complexity of the model
    2. Bregman proximal point optimization to prevent aggressive updating
  • Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding (xiaodong liu…gao, 2020)
  • Posterior Differential Regularization with f-divergence for Improving Model Robustness (hao cheng, …, gao 2021)
    • regularize model posterior difference between clean + noisy inputs (e.g. adversarially attacked inputs)

comparing different tasks

  • Task2Vec: Task Embedding for Meta-Learning (achille, …, soatto, perona, 2019) - summarize each task as a vector, by taking diagonal of fisher info matrix (derivative of network output wrt to parameters) - clusters similar tasks
  • Efficiently Tuned Parameters are Task Embeddings (zhou…mcauley, 2022)
  • Editing Models with Task Arithmetic (ilharco, ribeiro, …, farhadi, 2022) - task vector is model weights after task finetuning - model weights before finetuning
    • can use this direction to alter model behavior
  • Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation (vu….constant, 2022) - train with prompts of some (language translation, task) pairs and show that they can generalize to new (language, task) pairs

instruction tuning / rlhf

model merging

Model merging (some of these are non-transformer papers) = combine different models that have the same architecture (see collection of papers here and huggingface blog post here). Also see the review paper Deep Model Fusion: A Survey (li…shen, 2023)

  • standard methods (see mergekit package)
    1. linear averaging, e.g. model soups (wortsman…schmidt, 2021)
    2. spherical linear interpolation - interpolate angle but keep norm constant
    3. TIES: Resolving Interference When Merging Models (yadav…raffel, bansal, 2023)
      1. only keep top-k% most significant changes in weights
      2. vote on signs of parameters
    4. DARE (yu…li 2023)
      1. randomly reset $p$ fraction of changed fine-tuned weights to their original values in the base model
      2. rescale remaining changed weights by $1/(1-p)$
    5. passthrough/frankenmerging
      1. stack layers to yield model with different size
      2. e.g. depth up-scaling creates a larger model by merging some layers and copying others (solar 10.7B, kim…kim, 2023)
  • more complex posthoc methods
    • Learning to Route Among Specialized Experts for Zero-Shot Generalization (muqeeth, …, raffel, 2024) - PHATGOOSE routes to different LoRA model for each token and at each layer
    • Fisher-Weighted Averaging (matena & raffel, 2022) - merge models with same architecture with particular weights
    • Git Re-Basin: Merging Models modulo Permutation Symmetries (ainsworth, hayase, & srinivasa, 2022) - permute units of one model to align them with a reference model before merging; supports linear mode connectivity between ResNet models on CIFAR
      • ZipIt! Merging Models from Different Tasks without Training (stoica…hoffman, 2023) - layerwise merging & don’t merge all the layers
    • Model Merging by Uncertainty-Based Gradient Matching (adheim…khan, 2023)
    • UnIVAL: multimodal merging (shukor…cord, 2023)
    • LoraHub (huang…lin, 2023) - fiven examples from a new task, merge LoRA adaptors
    • AdaMerging: Adaptive Model Merging for Multi-Task Learning (yang…tao, 2023) - learn coefficients to average models by minimizing entropy on unlabeled test samples
    • Model Ratatouille: Recycling Diverse Models for Out-of-Distribution Generalization (rame…bottou, lopez-paz, 2022) - finetune many models initially trained on diverse tasks then average their weights
  • training paradigms
    • Branch-Train-Merge: ELMS (Expert LMs) (li…smith, zettlemoyer 2022)
      • parallel language model of smaller expert LMs
      • each can be added/removed, ensembled, or parameter-averaged at any time for efficient scaling and rapid customization
      • improves perplexities, when controlling for training cost
        • require expert domain specialization
      • Cluster-Branch-Train-Merge (gururangan…smith, zettlemoyer, 2023) - start by clustering data to do unsupervised domain discovery
  • fit many models into one
    • superposition of many models into one (cheung…olshausen, 2019) - both during training/testing models are indexed via a high-dim key for each task
    • supermasks in superposition (wortsman, …, yosinski, farhadi, 2020) - randomly fixed base net + for each task finds subnet that performs well
      • if task identity not given, correct subnet inferred by minimizing output entropy
  • non-transformer
    • snapshot ensembles - average different checkpoints during training (huang et al. 2017)
    • stochastic weight averaging (izmailov, …, wilson, 2019) - average multiple checkpoints during training
    • batch ensemble (wen et al. 2020) - have several rank-1 keys that index different weights hidden within one neural net
    • data-based distillation for model merging (roth…akata, 2024) - can combine multiple models that excel at different classes using data-based distillation
    • Model Fusion via Optimal Transport (singh & jaggi, 2019) - layer-wise fusion algorithm using optimal transport
    • Qualitatively characterizing neural network optimization problems (goodfellow, viynals, & saxe, 2014) - linear interpolation experiments on DNNs


Editing is generally very similar to just adaptation/finetuning. One distinction is that it tends to try to keep changes localized, in an effort not to affect performance for most of the model.

  • Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs (zhang, singh, liu, liu, yu, gao, zhao, 2023) - upweight attention scores at specific positions to improve LLM controllability
  • Editing Large Language Models: Problems, Methods, and Opportunities (yao, …, zhang, 2023)
    • model-editing = data-efficient alterations to a model
  • memory-based
    • SERAC: Memory-Based Model Editing at Scale (mitchell…manning, finn, 2022)
      • keep track of list of edits in external memory and use them as appropriate context at test time (don’t finetune the model)
    • T-Patcher (Huang et al., 2023) and CaliNET (Dong et al., 2022) introduce extra trainable parameters into the feed- forward module of PLMs
  • weight updates
    • Knowledge Neurons in Pretrained Transformers (dai et al. 2021) - integrated gradients wrt to each neuron in BERT, then selectively udpate these neurons
    • ROME: Locating and Editing Factual Associations in GPT (meng, bau et al. 2022 )
      • localize factual associations - causal intervention for identifying neuron activations that are decisive in a model’s factual predictions
        • “causal traces” - run net multiple times, introducing corruptions and then restore states from original non-corrupted forward pass to see which states can restore the original results
        • a small number of states contain info that can flip the model from one state to another
      • change factual associations - modify feedforward weights to update specific factual associations using Rank-One Model Editing (ROME)
      • MEMIT: Mass Editing Memory in a Transformer (meng…, bau, 2022)
      • Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adapters (hartvigsen, …, palangi, …, ghassemi, 2023)
      • Flexible Model Interpretability through Natural Language Model Editing (d’oosterlinck, …, potts, 2023)
      • Model Editing with Canonical Examples (hewitt, …, liang, manning, 2024)
    • meta-learning
      • KnowledgeEditor: Editing Factual Knowledge in Language Models (de cao, aziz, & titov, 2021) - train a network that takes in input, output, edit and predicts a weight update to the model
      • MEND: Fast model editing at scale (mitchell…finn, manning, 2022)
        • a collection of small auxiliary editing networks that use a single desired input-output pair to edit a pre-trained model
        • MEND learns to transform the gradient obtained by standard fine-tuning, using a low-rank decomposition of the gradient
  • REMEDI (hernandez, li, & andreas, 2023) and related activation engineering
    • get “edit vectors” by obtaining embeddings when passing attributes through LLM
    • perform edit by by adding linear transformation of edit vector to prompt embedding
      • then, perform generation with latent embedding
      • learn linear transformation given a dataset of examples with attributes and desired completions
        • (also regularize the model to not change too much on other stuff)
    • Activation Addition: Steering Language Models Without Optimization (turner…macdiarmid, 2023)
      • blog post: activation engineering: Steering GPT-2-XL by adding an activation vector (turner, …, mini, 2023)
      • obtain “steering vector” by embedding a phrase (e.g. love) and adding that vector to the llm embedding during generation
        • they only add the embedding for some layers for some tokens
      • Extracting Latent Steering Vectors from Pretrained Language Models (subramani, …, peters, 2022) - find latent vectors via optimization that cause an LLM to output a particular sequence
        • then, use these vectors to do things like transfer to new tasks / compute textual similarity
      • Function Vectors in Large Language Models (todd…wallace, bau, 2023)
  • PURR: Efficiently Editing Language Model Hallucinations by Denoising Language Model Corruptions (chen…sameer singh…kelvin guu, 2023)
  • new datasets
    • MQUAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions (zhong…manning, potts, chen, 2023) - introduces benchmark MQUAKE + method MeLLo, which stores edited facts externally while prompting the language model iteratively to generate answers that are consistent with the edited facts
    • COUNTERFACT+ benchmark - checks that edits don’t affect existing info
    • ALMANACS: A Simulatability Benchmark for Language Model Explainability
  • model unlearning approaches (see review Rethinking Machine Unlearning for Large Language Models, liu et al. 2024)
    • gradient ascent - worsen performance on set of examples to forget
    • gradient descent - improve performance on examples labeled with hidden info, e.g. response “I don’t know”
    • localization-informed unlearning, e.g. ROME
    • influence function-based methods
    • prompt-based (e.g. only change prompt rather than model parameters)

direct weight inspection

  • overviews
    • Overview of mechanistic interpretability (nanda, 2022+)
    • review paper (rauker…hadfield-menell, 2023)
    • Representation engineering: A Top-Down Approach to AI Transparency (zou…kolter, hendrycks, 2023)
      • representation engineering (RepE) - analyzes representations/representation transformations rather than neurons or circuits
      • basically extends probing to more general tasks, including model control
  • Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors (yun, chen, olshausen, lecun, 2021) - investigate LLM embeddings of different words using dictionary learning
    • LLMs produce interesting contextualized word embeddings
    • dictionary elements (of activations across layers) correspond to meaningful things
    • dictionary element has size $d$, the embedding size
      • given list of sentences $S$, training matrix has size $\left(\underbrace{\text{num_layers}}{\text{12 for BERT}} \cdot \sum{s \in S} \text{len(s)}\right) \times \underbrace{d}_{\text{768 for BERT}}$
    • dictionary coefficient: maps (text, layer, sequence_index) $\to$ coefficient
      • extract $d$-dimensional embedding for text at specified layer & sequence_index
  • Neuron-level Interpretation of Deep NLP Models: A Survey (sajjad et al. 2022)
    • previous works generally use pre-specified concepts, and focus on
      • concept search - given a neuron find its concept(s)
      • neuron search - (ii) given a concept find its matching neuron(s)
    • concept search
      • visualization, e.g. karpathy, johnson, fei-fei li, 2015 visualize LSTM head response in text
      • elicit top-k ngram responses on a corpus, which are then labelled manually (kadar et al. 2017)
      • elicit top-k activating sentences from a corpus, which are then summarized using a parse tree into a synthetic explanation (na…kim, 2019)
        • limitation: the explanation may be ungrammatical and biased towards something arbitrary (like reptition)
      • input maximization (e.g. textattack, poerner et al. 2018)
    • Evaluating Neuron Interpretation Methods of NLP Models (fan…sajjad, 2023) - metric is how well evaluation from one method matches the other ones
  • A Circuit for Indirect Object Identification in GPT-2 small (wang, …, steinhardt, 2022)
    • explanation encompasses 26 attention heads grouped into 7 main classes
    • task: indirect object identification - “When Mary and John went to the store, John gave a drink to ___” should be “Mary”
    • circuit
      • identify all previous names
      • remove duplicated names
      • output remaining name
    • Circuit Component Reuse Across Tasks in Transformer Language Models (merullo, eickhoff, & pavlick 2024) - find that the same circuit is used for 2 different tasks: IOI from above and Colored objects (from big-bench)
  • Interpretability at Scale: Identifying Causal Mechanisms in Alpaca (wu…, potts, goodman, 2023) - propose boundless DAS and automatically identify a circuit for math
  • N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models (foote, nanda, …, barez, 2023) - explain each neuron in a graph
  • Finding Skill Neurons in Pre-trained Transformer-based Language Models (wang et al. 2022) - some individual neurons are predictive of the final task (dubbed “skill neurons’)
  • thread (elhage…olah, 2021)
  • all layers are same dimension and each attention block adds a vector to it
  • Although they’re parameterized as separate matrices, $W_O W_V$ and $W_Q^T W_K$ can always be thought of as individual, low-rank matrices
    • $x \in \mathbb R^{d_{embed} \times d_{sequence}}$: $d_{embed}$ can be hundreds - tens of thousands
    • $W_Q, W_K, W_V \in \mathbb R^{d_{attn} \times d_{embed}}$
      • $W_Q^TW_k \in \mathbb R ^{d_{embed} \times d_{embed}}$
    • $W_O \in \mathbb R^{d_{embed} \times d_{attn}}$: projects attention values back to embedding dimention
      • $W_O W_V \in \mathbb R ^{d_{embed} \times d_{embed}}$
    • $W_E \in \mathbb R^{d_{embed} \times d_{vocab}}$ embeds initial tokens and $W_U \in \mathbb R^{d_{vocab} \times d_{embed}}$ undoes the embedding
      • $d_{vocab}$ can be very large, e.g. 50k
    • $A = \text{softmax}(x^TW_Q^TW_kx) \in \mathbb R^{d_{sequence} \times d_{sequence}}$
  • if we have a 0-layer net (e.g. predict next token with linear layer given current token), we just learn bigram log-likelihood
  • 2 circuits
    • QK circuit determines which “source” token the present “destination” token attends back to and copies information from
      • $W_{E}^{T} W_{Q}^{T} W_{K} W_{E} \in \mathbb R ^{d_{vocab} \times d_{vocab}}$
    • OV circuit describes what the resulting effect on the “out” predictions for the next token is
      • $W_{U} W_{O} W_{V} W_{E} \in \mathbb R ^{d_{vocab} \times d_{vocab}}$
  • if a single head increases the probability of both keep… in mind and keep… at bay, it must also increase the probability of keep… in bay and keep… at mind
  • induction heads search previous examples of present token
    • If they don’t find it, they attend to the first token and do nothing
    • if they do find it, they then look at the next token and copy it. This allows them to repeat previous sequences of tokens, both exactly and approximately
    • sometimes can do some kind of “fuzzy” matching
  • tensor/kronecker product $\bigotimes$:
    • Left-right multiplying: Multiplying $x$ by a tensor product $A \otimes W$ is equivalent to simultaneously left and right multiplying: $(A \otimes W) x=A x W^{T}$
    • When we add them, it is equivalent to adding the results of this multiplication: $\left(A_{1} \otimes W_{1}+A_{2} \otimes W_{2}\right) x=A_{1} x W_{1}^{T}+A_{2} x W_{2}^{T}$ Softmax Linear Units
  • replacing activation function with softmax linear unit increases fraction of MLP neurons which are “interpretable”, i.e. correspond to meaningful features
    • however, may “hide” some non-neuron-aligned features by decreasing their magnitude and then later recovering it with LayerNorm
  • the presence of nonlinear activation functions createse an incentive for features to align with this basis and not get superposed
    • if the gains to sparse coding are large enough, this incentive will get overwhelmed
  • ways to combat polysemanticity
    • activation sparsity
    • lateral inhibition / co-occurrence sparsity
    • weight sparsity
    • superlinear activation functions
    • increase neurons per param
  • $\text{SoLU}(x) = x \cdot \text{softmax}(x)$
    • adds lateral inhibition, superlinearity, approximate sparsity
    • changes GeLU, which is approximately $\text{sigmoid}(1.7x) \cdot x$
    • just changing to SoLU decrease performance, had to add LayerNorm afterwards
  • logit lens (2020) - apply unembedding matrix to outputs of each transformer layer
  • In-Context Language Learning: Architectures and Algorithms (akyurek…andreas, 2024) - find evidence for “n-gram heads”, higher-order variants of previously seen “induction heads”
  • A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention (cui…zdeborova, 2024) - solve 1-layer attention model for histogram task and find phase transition
  • Rosetta Neurons: Mining the Common Units in a Model Zoo (dravid, …, efros, shocher, 2023)
  • The Hydra Effect: Emergent Self-repair in Language Model Computations (mcgrath…legg, 2023) - ablations atone attention layer of an LLM cause another layer to compensate
  • Neurons in Large Language Models: Dead, N-gram, Positional (voita, ferrando, & nalmpantis, 2023)
  • Vision transformers need registers (darcet…mairal, bojanowski, 2023)
    • adding extra [reg1], [reg2] tokens that aren’t used at output improve vision transformer performance and attention map interpretability
    • without these tokens, attention maps are sometimes very noisy, particularly for uninformative tokens
  • Efficient Streaming Language Models with Attention Sinks (xiao…lewis, 2023)
  • Codebook Features: Sparse and Discrete Interpretability for Neural Networks (tamkin, taufeeque, & goodman, 2023)
  • Patchscope (ghandeharioun…geva, 2023) - decode LLM’s representation of a token by asking another copy of it to decode from that same representation
  • Program synthesis via mechanistic interpretability (michaud…tegmark) - condense RNN on simple algorithmic tasks into code
  • Linear Representations of Sentiment in Large Language Models (tigges…nanda, 2023) - sentiment is distributed across tokens (not just at sentiment-laden words)

debugging / interpretation

  • TalkToModel: Understanding Machine Learning Models With Open Ended Dialogues (slack…lakkaraju, sameer singh, 2022) - natural language interface to query model (by converting to commands such as filtering the data / calculating importance)
    • Rethinking Explainability as a Dialogue: A Practitioner’s Perspective (lakkaraju, slack, …, sameer singh, 2022) - interviews with high-stakes users suggest they would like to be able to interact with systems via dialog
  • AdaTest: Adaptive Testing and Debugging of NLP Models (ribeiro & lundberg, 2022)
    • goal: easily specify, discover, and fix undesirable behaviors in an NLP model
    • 2-step iterative algorithm
      1. LLM generates many tests targeting the model’s failures

        • example of a test: f(“I am a black woman”) ≠ neg

        • user selects and organizes the tests and reprompts the LLM to find more

      2. User fixes the tests (e.g. via finetuning)

    • Checklist –Beyond Accuracy: Behavioral Testing of NLP models with CheckList (ribeiro…sameer singh, 2020)
      • matrix of general linguistic capabilities + test types
  • Fixing Model Bugs with Natural Language Patches (murty, manning, lundberg, & ribeiro 2022)
    • specify patches with natural language rather than hard rule, allowing them to better handle text
    • finetune a model to combine original model output with output from a patch-conditioned interpreter head
  • interpretable models
    • Aug-imodels: Augmenting Interpretable Models with LLMs during Training (singh, askari, caruana, & gao, 2023)
    • Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens (liu, min, zettlemoyer, choic, & Hajishirzi, 2024)
      • motivation: hard to scale ngram models to large datasets and large data lengths

      • soln 1: backoff (Jurafsky & Martin, 2000) - select n based on the longest suffix of the prompt that has a non-zero count in the corpus
        • counts of the next token yield the prob. of the next token
        • Katz backoff (Katz, 1987) discounts probs to yield valid prob. distr.
      • soln 2: represent prob. table in suffix array to make things very fast
        • suffix array stores address to each location in the training data alphabetically sorted
          • roughly the same size
        • this makes it fast to search for instances of an ngram (and also for what precedes/follows it)
      • results show that infinigram can considerably improve perplexities when it is linearly combined with the logits from LLMs (experiments up to llama-2 70B)


  • Interpretability and Transparency-Driven Detection and Transformation of Textual Adversarial Examples (IT-DT) (sabir, babar, & abuadbba, 2023)
    • leverages techniques such as attention maps, integrated gradients, and model feedback to detect and then change adversarial inputs
  • datasets: harmbench & trustllm
  • attacks from TextAttack:
Attack Recipe Name Goal Function Constraints Enforced Transformation Search Method Main Idea
a2t Untargeted {Classification, Entailment} Percentage of words perturbed, Word embedding distance, DistilBERT sentence encoding cosine similarity, part-of-speech consistency Counter-fitted word embedding swap (or) BERT Masked Token Prediction Greedy-WIR (gradient) Yoo et al., 2021
alzantot Untargeted {Classification, Entailment} Percentage of words perturbed, Language Model perplexity, Word embedding distance Counter-fitted word embedding swap Genetic Algorithm Alzantot et al., 2018
bae Untargeted Classification USE sentence encoding cosine similarity BERT Masked Token Prediction Greedy-WIR Garg & Ramakrishnan, 2019.
bert-attack Untargeted Classification USE sentence encoding cosine similarity, Maximum number of words perturbed BERT Masked Token Prediction (with subword expansion) Greedy-WIR Li et al., 2020
checklist {Untargeted, Targeted} Classification checklist distance contract, extend, and substitutes name entities Greedy-WIR Ribeiro et al., 2020
clare Untargeted {Classification, Entailment} USE sentence encoding cosine similarity RoBERTa Masked Prediction for token swap, insert and merge Greedy Li et al., 2020
deepwordbug {Untargeted, Targeted} Classification Levenshtein edit distance {Character Insertion, Character Deletion, Neighboring Character Swap, Character Substitution} Greedy-WIR Gao et al., 2018
fast-alzantot Untargeted {Classification, Entailment} Percentage of words perturbed, Language Model perplexity, Word embedding distance Counter-fitted word embedding swap Genetic Algorithm Jia et al., 2019
hotflip Untargeted Classification Word Embedding Cosine Similarity, Part-of-speech match, Number of words perturbed Gradient-Based Word Swap Beam search Ebrahimi et al., 2017
iga Untargeted {Classification, Entailment} Percentage of words perturbed, Word embedding distance Counter-fitted word embedding swap Genetic Algorithm Wang et al., 2019
input-reduction Input Reduction   Word deletion Greedy-WIR Feng et al., 2018
kuleshov Untargeted Classification Thought vector encoding cosine similarity, Language model similarity probability Counter-fitted word embedding swap Greedy word swap Kuleshov et al., 2018
pruthi Untargeted Classification Minimum word length, Maximum number of words perturbed {Neighboring Character Swap, Character Deletion, Character Insertion, Keyboard-Based Character Swap} Greedy search Pruthi et al., 2019
pso Untargeted Classification   HowNet Word Swap Particle Swarm Optimization Zang et al., 2020
pwws Untargeted Classification   WordNet-based synonym swap Greedy-WIR (saliency) Ren et al., 2019
textbugger Untargeted Classification USE sentence encoding cosine similarity {Character Insertion, Character Deletion, Neighboring Character Swap, Character Substitution} Greedy-WIR Li et al., 2018.
textfooler Untargeted {Classification, Entailment} Word Embedding Distance, Part-of-speech match, USE sentence encoding cosine similarity Counter-fitted word embedding swap Greedy-WIR Jin et al., 2019

architecture/attention variants

  • state space models (good overview in albert gu thesis, 2023)
    • S4: structured state space models (gu…re, 2022) - similar to RNNs but can predict all outputs at once via convolution
      • the core of the state space model is basically a linear RNN
        • inputs x, hidden states h, outputs y
        • 3 matrices: $A, B, C$
        • $y_i = C h_i$
        • $h_i = A h_{i-1} + B x_i$
          • note: there is no nonlinearity between hidden states
          • note: the transition from one hidden state to the next is the same for all positions (except for the input)
        • can compute hidden states simultaneously by just pre-multiplying these A and B matrices with x the right number of times ( a convolution operation)
    • mamba: selective state space models (gu & dao, 2023)
      • changes (2) above – the transition from one hidden state to the next now depends on the input (making it closer to LSTMs)
        • $B = B(x)$
        • $C = C(x)$
    • Tree Transformer: Integrating Tree Structures into Self-Attention (wang, .., chen, 2019)
    • Waveformer: Linear-Time Attention with Forward and Backward Wavelet Transform (zhuang…shang, 2022)
  • White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is? (yaodong yu…yi ma, 2023)

mixture of experts (MoE) / routing

mixture of experts models have become popular because of the need for (1) fast speed / low memory at test time while still (2) having a large model during training

  • note: nowadays often the “experts” are different MLPs following the self-attention layers
  • A Review of Sparse Expert Models in Deep Learning (fedus, jeff dean, zoph, 2022)
    • sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models
    • routing algorithm - determines where to send examples
      • discreteness makes it difficult
        • some works use RL to learn routing
        • standard approach uses gumbel-softmax
        • usually get matrix of similarities between input tokens and experts and route based on these
          • sometimes route to topk experts rather than top1
      • load balancing - usually add an auxiliary loss to encourage equal tokens being sent to different experts
  • non-specialized experts
  • routing notes - make hard decision but still want to learn probabilities
    • straight-through estimator (STE) - take the argmax during the forward pass, while considering the original probabilities in the backward pass
      • highly biased
    • gumbel-softmax- allows for better sampling
  • specialized experts as fully independent models (sometimes for multi-task learning)
  • Towards Understanding Mixture of Experts in Deep Learning (chen…gu, li, 2022)

symbolic reasoning

See also notes on đź“Ś comp neuro.

  • Compositional processing emerges in neural networks solving math problems (russin, roland fernandez, …, smolensky, gao, 2021)
  • Modular Deep Learning (pfeiffer, ruder, .., ponti, 2023) - overview of different modular architectures
  • neurocompositional computing (smolensky…gao, 2022)
    • longer tutorial (smolensky, …, gao, 2022)

    • central paradox of cognition is that brain both uses continuous neural symbols but is compositional (smolensky et al. 1992)
      • Compositionality
      • Continuity - the encoding and processing of information is formalized with real numbers that vary continuously
    • 3 challenges: compositional generalization, data efficiency, comprehensibility
    • solution - NECST: Neurally-Encoded Compositionally-Structured Tensor computing (smolensky & legendre, 2006) - basically leverages TPR
      • TPR roles and fillers can both be made continuous
    • neural space vs symbolic space (many different things (e.g. sentences) can mean the same thing) - word vectors can be thought of as “soft symbols”
    • want to move from symbolic repr. to neural repr. while keeping interpretability
      • system should output intermediate steps in addition to answer
      • thinking fast (system 1: fast, intuitive) + slow (system 2: slower, logical, derivative)
    • concrete proposal: transformer activation vector should encode graph of flow through the network
      • ex. task: regurgitate a sequence
  • NECSTransformer: Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving (schlag, smolensky, …, schmidhuber, gao, 2019)
    • TP-attention
    • beat SOA on free-form math word-problems
    • in addition to K, Q, V, also add a role-vector
      • do element-wise multiplication of outputted vector with role-vector
    • TPR built as outer product of 2 vectors:
      • filler - the vector returned by attention
        • ex. one head learns “second-argument-of”
      • role - a relation conceptually labeling an edge of the attention graph
  • TP-N2F: Tensor Product Representation for Natural To Formal Language Generation - Microsoft Research (chen…gao, 2019)
  • Logical Transformers: Infusing Logical Structures into Pre-Trained Language Models (wang, huang, …, gao, 2023) - use logical model to alter embeddings before feeding to LLM
  • Implicit Chain of Thought Reasoning via Knowledge Distillation (deng…smolensky…, 2023)

embeddings / retrieval-augmented generation

  • detailed overview of info retrieval (bruch, 2024)
  • introductory blog post on embeddings
  • basic training pipeline
    1. standard self-supervised pre-training, e.g. BERT
    2. weak unsupervised pre-training, e.g. weakly related text pairs, such as QA pairs from forums like StackExchange and Quora
    3. high-quality contrastive finetuning on curated paired data, e.g. QA from web searches
  • datasets
    • MTEB leaderboard
    • Instructor eval
      • Billboard
      • Prompt retrieval
    • Long contexts
    • Older
    • Training
      • Nomic 235M curated text pairs (mostly filtered from here)
        • Followed by supervised contrastive fine-tuning on datasets like MSMarco, NQ, NLI, HotpotQA, Fever, WikiAnswers, etc.
      • MEDI (from Instructor paper): combines 300 datasets from Super- NaturalInstructions with 30 datasets from existing collections designed for embedding training
  • customization
    • e.g. add prompt or prefixes like search query, search document, classification, clustering before embedding so model knows how to match things
  • top-performing models
  • embedding approaches overview
    • 3 levels of interaction
      • bi-encoder: separately encode query & doc
      • cross-encoder: encode query and doc together
      • late-interaction encoder: hybrid, separately encode, but then learn some params on how to compute similarity between them (e.g. ColBERT)
    • expansion & reweighting (e.g. doc2query)
    • sparse representation learning (e.g. UHD-BERT (jang…seo, 2021))
    • joint learning with index
    • prior work: query expansion, term dependency model (e.g. tf-idf), topic model, translation model
  • query expansion
  • embedding search monograph (bruch, 2024)
  • Active Retrieval Augmented Generation (jiang…neubig, 2023) - introduce FLARE, a method that iteratively uses a prediction of the upcoming sentence to anticipate future content, which is then utilized as a query to retrieve relevant documents to regenerate the sentence if it contains low-confidence tokens
  • Matryoshka Representation Learning (kusupati…kakade, jain, & farhadi, 2022) - in training given an embedding of full dimensionality M (e.g. 2048), learn N different distance functions for each prefix of the embedding (e.g. l2_norm(embedding[:32]), l2_norm(embedding[:64]), l2_norm(embedding[:128]), etc).
  • Hypothetical Document Embeddings (gao…callan, 2022) - generate hypothetical document from query + instruction using GPT and find match for that doc
  • Probing embeddings
  • RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval (sarthi…manning) - retrieve many docs and cluster/summarize before using
  • Seven Failure Points When Engineering a Retrieval Augmented Generation System (barnet…abdelrazek, 2024)
  • Retrieve to Explain: Evidence-driven Predictions with Language Models (patel…corneil, 2024)
  • Explaining embeddings
    • Computer-vision focused
      • Axiomatic Explanations for Visual Search, Retrieval, and Similarity Learning (hamilton, lundberg…freeman, 2021) - add in “second-order” methods that look at similarities between different image features in the 2 images being compared
      • Why do These Match? Explaining the Behavior of Image Similarity Models (plummer…saenko, forsyth, 2020) - generate saliency map + with an attribute based on the salient region
      • Towards Visually Explaining Similarity Models (zheng…wu, 2020) - similarity of cnn embeddings
    • Interpretable entity representations through large-scale typing (onoe & durrett, 2020) - embedding is interpretable predictions for different entities
  • Explaining similarity with different outputs
    • Analogies and Feature Attributions for Model Agnostic Explanation of Similarity Learners (ramamurthy…tariq, 2022) - returned explanation is an analogy (pair from the training set) rather than a saliency map
    • Sim2Word: Explaining Similarity with Representative Attribute Words via Counterfactual Explanations (chen…cao, 2023) - give both saliency map + counterfactual explanation


  • SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot (frantar & alistarh, 2023) - prune GPT-style models to atleast 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy
  • Cramming: Training a Language Model on a Single GPU in One Day (geiping & goldstein, 2022) - tricks for training BERT


dataset / module explanation

  • Rethinking Interpretability in the Era of Large Language Models (singh et al. 2024) - review emphasizing emerging areas like dataset explanation
  • dataset explanation
    • iPrompt: Explaining Patterns in Data with Language Models via Interpretable Autoprompting (singh, morris, …gao, 2022) - prompting approach
    • Instruction Induction: From Few Examples to Natural Language Task Descriptions (honovich…bowman, levy 2022) - directly query model with prompt to search for task description
    • D3: Describing Differences between Text Distributions with Natural Language (zhong, snell, klein, & steinhardt, 2022) - finetune an LLM to directly describe difference between 2 text distrs
      • D5: Goal Driven Discovery of Distributional Differences via Language Descriptions (zhong, zhang, …, klein, & steinhardt, 2023) - add dataset-specific prompt + evaluation on larger set of 675 datasets
      • technically this is just learning a classifier, where the classifier is a natural-language string
      • method
        • proposer network generates hypotheses
        • verifier networks looks at all samples in the dataset (since proposer couldn’t fit them all in context) and returns how accurate the hypotheses were
        • some tricks
          • select samples which are “representative” of a class by predicting with another LLM
          • have a pool of 302 manual hypotheses they usefor seeding
      • Goal-Driven Explainable Clustering via Language Descriptions (wang…, zhong, 2023)
      • Mass-Producing Failures of Multimodal Systems with Language Models (tong, jones, & steinhardt, 2023)
      • TopicGPT: A Prompt-based Topic Modeling Framework (pham…iyyer, 2023)
    • GSCLIP : A Framework for Explaining Distribution Shifts in Natural Language (zhu…james zou, 2022) - automatically explain dataset-level distribution shifts (in image datasets) with natural language
    • MaNtLE: Model-agnostic Natural Language Explainer (menon, zaman, & srivastava, 2023) - train model to generate explanations on simple tables (they do this for classifier outputs but could easily do it directly for data labels)
    • Large Language Models for Automated Open-domain Scientific Hypotheses Discovery (yang…cambria, 2023)
    • Scaling deep learning for materials discovery (merchant…cubuk, 2023)
  • module explanation in natural language
    • Explaining black box text modules in natural language with language models (singh, hsu, …, gao, 2023)
    • Language models can explain neurons in language models (bills, cammarata, …saunders, 2023, openai)
      • goal: explain a neuron
        • step 1: summarize (token, activation) pairs into an explanation
        • step 2: create simulated neuron that outputs activations given tokens
        • step 3: check correlation of simulated neuron outputs with real neuron outputs
      • their unigram baseline summarizes top unigrams into a string
      • they use synthetic generated data to revise the explanation
      • they also do some recovery tests on “neuron puzzles”
      • The Importance of Prompt Tuning for Automated Neuron Explanations (lee…weng, 2023)
    • MILAN: Natural Language Descriptions of Deep Visual Features (hernandez…david bau…torallba, andreas, 2022) - given a neuron, generates a natural-language string that maximizes pointwise mutual information with the image regions in which the neuron is active
      • Scale Alone Does not Improve Mechanistic Interpretability in Vision Models (zimmermann, klein, & brendel, 2023) - perform human eval of interpretability of different units (show human top-activating patches and ask them to decide which of 2 patches will be top-activating)
    • Evaluation

directly learning algorithms / in-context

  • Empirical results
    • FunSearch: Mathematical discoveries from program search with large language models (deepmind, 2023)
    • Faster sorting algorithms discovered using deep reinforcement learning (deepmind, 2023)
    • Discovering faster matrix multiplication algorithms with reinforcement learning (deepmind, 2022)
    • Nuclear fusion control (deepmind, 2022)
  • Alphafold
  • What Can Transformers Learn In-Context? A Case Study of Simple Function Classes (garg, tsipras, liang, & valiant, 2022) - models can succesfully metalearn functions like OLS
    • e.g. during training, learn inputs-outputs from different linear functions
    • during testing, have to predict outputs for inputs from a different linear function
    • also test on slightly harder functions, like decision trees and 2-layer nets
    • Decision tree (zhuang…gao, 2024) - transformer can learn to algorithmically interpolate between CART and GOSDT
    • What Algorithms can Transformers Learn? A Study in Length Generalization (zhou…bengio, nakkiran, 2023) - Transformers tend to length generalize on a task if the task can be solved by a short RASP program which works for all input lengthsr
    • Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions (bhattamishra…varun kanade, 2023) - on boolean functions, transformers can learn to match optimal aglorithms for simple tasks but not on complex tasks
      • Transformers can learn to implement two distinct algorithms to solve a single task, and can adaptively select the more sample-efficient algorithm depending on the sequence of in-context examples
    • Limits of Transformer Language Models on Learning Algorithmic Compositions (thomm…scholkopf, rahimi, 2024)
    • Dissecting Chain-of-Thought: Compositionality through In-Context Filtering and Learning (li…papailiopoulos, oymak, 2023) - CoT helps LLMs learn MLP compositional functions in-context
  • Learning a (sparse) linear model
  • What learning algorithm is in-context learning? Investigations with linear models (aykurek, schuurmans, andreas, ma, & zhou, 2023) - investigate prompting through synthetic experiments with transformers trained for linear regression
    • Transformers as Algorithms: Generalization and Implicit Model Selection in In-context Learning (li, …, oymak, 2023) - generalization bounds for in-context learning when the input prompt is (1) a sequence of i.i.d. (input, label) pairs or (2) a trajectory arising from a dynamical system
    • Trained Transformers Learn Linear Models In-Context (zhang, frei, & bartlett, 2023)
    • One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention (Mahankali, Hashimoto, Ma, 23)
      • math analysis for: icl can do gradient decent on linear regression
    • Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression (raventos…ganguli, 2023)
  • Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models (fu…sharan, 2023)
  • Teaching Algorithmic Reasoning via In-context Learning (zhou…sedghi, 2022)
  • Looped Transformers as Programmable Computers (giannou, …, jason lee, papailiopoulos, 2023) - use transformers as universal computers by programming them with specific weights
  • Learning mathematical problems (francois charton)
  • Theory (don’t directly predict algorithm)
    • Meta-learning for Mixed Linear Regression (kong…kakade, oh, 2020) - generalization for linear regression based on which linear tasks were seen before
  • Limitations
    • Faith and Fate: Limits of Transformers on Compositionality (dziri…choi, 2023) - LLMs can’t (easily) be trained well for multiplication (and similar tasks)

cool tasks

  • Forecasting Future World Events with Neural Networks (zou…hendrycks, 2022) - takes tasks from metaculus
  • Shortcut Learning of Large Language Models in Natural Language Understanding: A Survey (du et al. 2022)
  • Neurosymbolic Programming for Science (sun…costilla-reyes, 2022)
  • Discovering New Interpretable Conservation Laws as Sparse Invariants (liu…tegmark, 2023) - does not use transformers
  • evaluation without groundtruth
  • Learning from learning machines: a new generation of AI technology to meet the needs of science (berkeley+lbnl+, 2021)

    • do more than predict what will happen, they attempt to offer insight into how or why
    • AI-based language models powering drug discovery and development (]( et al. 2021])
    • BioTranslator: Multilingual translation for zero-shot biomedical classification (xu, woicik, poon, altman, & wang, 2023) - takes a user- written textual description of a new concept and then translates this description to a non-text biological data instance
      • results for biological data, e.g. genes, proteins
      • enables the identification of novel cell types using only a textual description
  • Learning to Generate Novel Scientific Directions with Contextualized Literature-based Discovery (wang…hope, 2023)
    • literature-based discovery (swanson, 1986) - focus on predicting pairwise links between concepts from papers (e.g. drug-disease links)
      • task 1: idea-sentence generation – given sentences describing background context + a seed term, generate a sentence describing an idea
      • task 2: idea-node prediction – given the background context, predict new links between existing concepts (and generate new concepts)
    • forecasting paper titles (blog post)
  • Communication with animals

    • Coller-Dolittle Prize for Inter-species Communication
    • Cetacean Translation Initiative: a roadmap to deciphering the communication of sperm whales (andreas, begus, …, wood, 2021)
      • sperm whale has largest brain
      • ML outputs are primarily a tool to constrain hypothesis space to build formal and interpretable descriptions of the sperm whale communication
    • A Theory of Unsupervised Translation Motivated by Understanding Animal Communication (goldwasser…paradise, 2023)
    • Approaching an unknown communication system by latent space exploration and causal inference (begus, leban, & gero, 2023) - manipulate GAN latent variables in approach called causal disentanglement with extreme values (CDEV)
    • Vowels and Diphthongs in Sperm Whales (begus, sprous, leban, & gero, 2023) - use data from the dominica sperm whale project (gero et al. 2014)
  • scientific organization (galactica)
    • related but smaller models
    • all data is processed in a common markdown format
    • task-specific tokens to support different types of knowledge (e.g. citations, step-by-step reasoning, different modalities, e.g. proteins)
    • chemical compounds (train on 2 mil / 110 mil from PubChem Compound, authors still want it to focus on text)
      • predict IUPAC name from SMILES formula e.g. CC(C)(C)C(=O)N(CC1=NC(=CS1)C(=O)OC)C2CCCCC2 -> methyl 2-[[cyclohexyl-(2,2-dimethylpropanoyl)]amino] methyl]thiazole-4-

      • moleculenet (wu et al. 2017) classification benchmark (6 tasks)

        • training set examples are trained as text during fitting

          • HIV - classify whether comopund inhibits HIV replication
          • BACE C - binding results (classification + regression) for BACE
          • BBBP - blood-brain barrier penetration(permeability) (binary classification)
          • Tox21 - qualitative toxicity on 12 targets (12-class multilabel binary)
          • SIDER - 27-class multi-class disorders in different organ systems
          • ClinTox - binary toxicity classification
        • ex. for BBBP (one of the 6 tasks) - question is posed in different ways during training

          Here is a SMILES formula:   
          Question: Will the chemical compound penetrate the blood-brain barrier?
          Answer: No
    • protein sequences
      • from 227 million in UniProt, look at only 0.5 million subset (called Swiss-Prot)
      • evaluate protein sequence perplexity
      • protein keyword prediction (predict keywords in UniProt, like “ATP-Binding”, “Cell membrane”)
      • protein function description - compare free-form description to GT UniProt function description

automated assistants / HITL

  • similar to causality, we may want to use interpretability just to understand our data rather than to get any form of model

  • Benchmarking Large Language Models As AI Research Agents (huang, vora, liang, & leskovec, 2023) - formulate concrete ml tasks (like improve accuracy on a kaggle task) and see how well LLMs can do at them

  • GATE: Eliciting Human Preferences with Language Models (li, tamkin, goodman, & andreas, 2023) - LMs guide the task specification process (e.g. content recommendation), which is both free-form and interactive

  • visualization

  • modeling

    • TalkToModel: Explaining Machine Learning Models with Interactive Natural Language Conversations (slack, krishna, lakkaraju, & singh, 2023) - train model to translate human queries into API calls (~30 calls, things like feature importance, filter data, counterfactual explanation)

    • TalkToEBM: LLMs Understand Glass-Box Models, Discover Surprises, and Suggest Repairs (lengerich…caruana, 2023) - use LLMs to analyze tabular data and make suggestions for EBMs
      • GAM Changer: Editing Generalized Additive Models with Interactive Visualization (wang…caruana, 2021) - gui for editing GAMs
    • Tisane: Authoring Statistical Models via Formal Reasoning from Conceptual and Data Relationships (jun, seo, heer, & just, 2022) - language to better specify assumptions when fitting GLMs / GLMMs
    • LLMs for Semi-Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering (hollmann, muller & hutter, 2023)

tabular data

  • neurips 2023 tabular workshop and review from feb 4 2024

  • value string methods - directly treating numerical values as strings and finetune GPT on them (everything is represented as text)
  • do not use text tokens
  • jointly encode table with text prompt / text in the table
    • TP-BERTa: Making Pre-trained Language Models Great on Tabular Prediction (2023)
      • adds relative magnitude tokenization - converts scalar numerical feature values to discrete tokens (discretization requires a label)
      • intra-feature attention approach integrates feature values with the corresponding feature names
    • UniPredict: Large Language Models are Universal Tabular Predictors (wang, wang, & sun, 2023) - use text and prompt descriptions
    • Trompt: Towards a Better Deep Neural Network for Tabular Data (chen…chang, 2023) - use a prompting-style approach
    • TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data (yin, neubig, …, riedel, 2020)
  • classification / predictions
    • TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second (hollman, …, hutter, 2022)
      • transformer takes in train + test dataset then outputs predictions
      • each row (data example) is treated as a token and test points attend only to training t
        • takes fixed-size 100 columns, with zero-padded columns at the end (during training, randomly subsample columns)
      • builds on prior-data fitted networks (PFNs) (muller, …, hutter, 2021)
      • trained on synthetic data
    • TabR: Unlocking the power of retrieval-augmented tabular deep learning (gorishniy…babenko, 2023)
    • TabLLM: Few-shot Classification of Tabular Data with Large Language Models (hegelsmann…, sontag, 2022)
    • Language models are weak learners (manikandan, jian, & kolter, 2023) - use prompted LLMs as weak learners in boosting algorithm for tabular data
    • TabRet: Pre-training Transformer-based Tabular Models for Unseen Columns (onishi…hayashi, 2023)
    • AnyPredict: A Universal Tabular Prediction System Based on Large Language Models - converting tabular data into machine-understandable prompts and fine-tuning LLMs to perform accurate predictions
  • interpretability
    • InterpreTabNet: Enhancing Interpretability of Tabular Data Using Deep Generative Models and Large Language Model (si…krishnan, 2023) - make attention sparse and describe it with GPT4
  • older

  • reviews
    • Transformers for Tabular Data Representation: A Survey of Models and Applications (badaro…papotti, 2023)
      • common data sources: Wikipedia tables for QA (e.g. 3.2M tables in this paper) or WDC web table corpus (233M tables from lehmberg et al. 2016)
      • modifications
        • positional embeddings based on rows + cols
        • attention variants: add row-wise, sparse attention allows for adding more context
      • Table Pre-training: A Survey on Model Architectures, Pretraining Objectives, and Downstream Tasks (dong et al. 2022)
      • Embeddings for Tabular Data: A Survey (singh & bedathur, 2023)
      • Deep neural networks and tabular data: A survey (borisov et al. 2022) - mostly compares performance on standard tasks (e.g. classification)

llm limitations / perspectives

text explanations (oldschool)

clinical papers

  • Self-Verification Improves Few-Shot Clinical Information Extraction (gero et al. 2023)
  • Large Language Models are Few-Shot Clinical Information Extractors (agrawal…sontag, 2022) - use GPT3
  • Health system-scale language models are all-purpose prediction engines (NYU 2023)
  • GPT4 in medicine book (lee, goldberg, & kohane, 2023)
    • For summaries: “Can you check the proposed note and identify any facts in it that don’t appear explicitly in the transcript?”
      • gpt often better at reviewing text than writing it
    • evaluation
      • hard to run gpt clinical trial, although can be used to identify candidates, e.g. biomarkers for followup tests
    • paperwork - replace patient intake form, medical encounter note, prior authorization note (to insurance), universal translator for health info / formatting
  • Evaluating Large Language Models on Medical Evidence Summarization (tang…peng, 2023) - score summaries based on 6 dimensions (e.g. coherence)
  • TRIALSCOPE: A Unifying Causal Framework for Scaling Real-World Evidence Generation with Biomedical Language Models (gonzalez, wong, gero, …, poon, 2023)
    • extract attributes from structured & unstructured EHR to form basis for clinical trial specification / experiments
  • Scaling Clinical Trial Matching Using Large Language Models: A Case Study in Oncology (wong, zhang, …, poon, 2023)
    • LLMs can structure eligibility criteria of clinical trials and extract complex matching logic (e.g., nested AND/OR/NOT)
  • BiomedJourney: Counterfactual Biomedical Image Generation by Instruction-Learning from Multimodal Patient Journeys (gu, yang, usuyama, …, gao, poon, 2023)
    • counterfactual biomedical image generation by instruction-learning from multimodal patient journeys
    • specifically, learn from triplets (prior image, progression description, new image), where GPT-4 generates progression description based on the image notes

evaluating with LLMs

  • G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment (liu…zhu, 2023, microsoft) - ask for a score (1-5) in different categories, e.g. fluency, relevance, …
  • Human-like Summarization Evaluation with ChatGPT (gao…wan, 2023) - prompt-based scoring of different categories, facts
  • Question-answering
    • FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation (min…hajishirzi, 2023) - breaks a generation into a series of facts and count what fraction of facts are supported by a reliable knowledge source
    • PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations (li…du, 2023)
  • Machine-translation
  • General NLG
    • ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate (chan…liu, 2023)
    • AlignScore: Evaluating Factual Consistency with a Unified Alignment Function (zha…hu, 2023) - train a model to explicitly evaluate factual consistency
    • Not All Metrics Are Guilty: Improving NLG Evaluation with LLM Paraphrasing (tang…wei, 2023)
  • Classical eval
    • BERTScore, BLEURTScore

Trained llms


  • Training Data Extraction From Pre-trained Language Models: A Survey (ishihara, 2023)
    • definitions
      • (eidetic memorization). A string s is k-eidetic memorized by LLMf if a prompt p exists such that f(p) = s and s appears at most k times in the training set
        • slightly different definition: A string s is k-memorized with k tokens of context from LLM f if a (length-k) string p exists such that the concatenation p + s is contained in the training set, and f produces s when prompted with p by using greedy decoding
      • Differential privacy = removing any data from the training set should not considerably change trained models
      • counterfactual memorization = difference between a training data’s expected loss under a model that has and has not been trained on that data
      • some studies loosen the definition of memorization using a similarity metric for strings rather than exact string matching
  • Extracting Training Data from Large Language Models (carlini, …, raffel, 2021) - LLMs are particularly likely to memorize atypical data points
    • Quantifying Memorization Across Neural Language Models (carlini, …, zhang, 2022)
    • What does it mean for a language model to preserve privacy? (brown, …, tramer, 2022) - “privacy-preserving” LM should guarantee that a user’s data cannot ever appear (or be inferable) outside the context they originally expected it to appear in
    • Can Neural Network Memorization Be Localized? (maini, …, lipton, kolter, zhang, 2023) - memorization is often confined to a small number of neurons or channels, propose example-tied dropout to direct memorization to few neurons
  • Detecting Personal Information in Training Corpora: an Analysis (subramani, luccioni, dodge, & mitchell, 2023)

paper parsing


  • attention = vector of importance weights
    • to predict or infer one element, such as a pixel in an image or a word in a sentence, we estimate using the attention vector how strongly it is correlated with (or “attends to” other elements and take the sum of their values weighted by the attention vector as the approximation of the target
  • vanilla transformer: multihead attention, add + norm, position-wise ffn, add + norm
  • self-attention layer implementation, mathematics, and chandan’s self-attention cheat-sheet

mathematical overview of transformers

  • based on Formal Algorithms for Transformers
  • tasks
    • sequence modeling: learn $p(x)$, usually factorized as $p(x_i x_1,…,x_{i-1})$
    • sequence-to-sequence: learn $p(z x)$, e.g. transalation, speech-to-text, question answering
  • preprocessing
    • embedding matrix takes in one-hot tokens and linearly maps them to a vector
    • positional embedding of a token is usually added to the token embedding to form a token’s initial embedding
  • attention types
    • Bidirectional / unmasked self-attention - primary/context vectors are the same
    • Unidirectional / masked self-attention - mask scores from before a given word
    • Cross-attention - primary/context vectors can come from different places
  • non-attention
    • layernorm: controls mean/variance of activations
      • RMSnorm: simpler version, sets mean/offset to zero
  • unembedding
    • linear layer (with softmax) that outputs size of original vocab
      • sometimes fixed to be transpose of the embedding matrix
  • predictions
    • predict next word using single linear layer on hidden state from previous word
    • finetune classification head often only using linear layer on first token from sequence
  • architectures
    • initially, encoder-decoder was common, but now often no decoder

visual explanation of self-attention

  • based on article by jay allamar

  • **self-attention ** - layer that lets word learn its relation to other layers
    • for each word, want score telling how much importance to place on each other word (queries $\cdot$ keys)
    • we get an encoding for each word
      • the encoding of each word returns a weighted sum of the values of the words (the current word gets the highest weight)
      • softmax this and use it to do weighted sum of valuesScreen Shot 2019-08-17 at 2.51.53 PM
    • (optional) implementation details
      • multi-headed attention - just like having many filters, get many encodings for each word
        • each one can take input as the embedding from the previous attention layer
      • position vector - add this into the embedding of each word (so words know how far apart they are) - usually use sin/cos rather than actual position number
      • padding mask - add zeros to the end of the sequence
      • look-ahead mask - might want to mask to only use previous words (e.g. if our final task is decoding)
      • residual + normalize - after self-attention layer, often have residual connection to previous input, which gets added then normalized
    • decoder - each word only allowed to attend to previous positions
    • 3 components
      • queries
      • keys
      • values
  • attention
    • encoder reads input and ouputs context vector after each word
    • decoder at each step uses a different weighted combination of these context vectors
      • specifically, at each step, decoder concatenates its hidden state w/ the attention vector (the weighted combination of the context vectors)
      • this is fed to a feedforward net to output a word
      • Screen Shot 2019-04-11 at 7.57.14 PM
    • at a high level we have $Q, K, V$ and compute $\text{softmax}(QK^T)V$
      • instead could simplify it and do $\text{softmax}(XX^T)V$ - this would then be based on kernel
  • transformer
    • uses many self-attention layers
    • many stacked layers in encoder + decoder (not rnn: self-attention + feed forward)
    • details
      • initial encoding: each word -> vector
      • each layer takes a list of fixed size (hyperparameter e.g. length of longest sentence) and outputs a list of that same fixed size (so one output for each word)
        • can easily train with a masked word to predict the word at the predicted position in the encoding
    • multi-headed attention has several of each of these (then just concat them)

huggingface tutorial

Broadly, models can be grouped into three categories:

  • GPT-like (also called auto-regressive Transformer models)
  • BERT-like (also called auto-encoding Transformer models)
  • BART/T5-like (also called sequence-to-sequence Transformer models)
  • Handling multiple sequences - Hugging Face Course
    • pad sequences to have the same length (need to modify attention masks to ignore the padded values)

pre-transformer nlp models

  • rnns
    • when training rnn, accumulate gradients over sequence and then update all at once
    • stacked rnns have outputs of rnns feed into another rnn
    • bidirectional rnn - one rnn left to right and another right to left (can concatenate, add, etc.)
  • standard seq2seq
    • encoder reads input and outputs context vector (the hidden state)
    • decoder (rnn) takes this context vector and generates a sequence