transfer learning
Contents
1.1. transfer learning#
See also notes on 📌 causal inference for some close connections.
For neural-net specific transferring see 📌 adaption/transfer.
1.1.1. overviews#
(from this paper)
1.1.2. domain adaptation algorithms#
Domain test bed available here, for generalizating to new domains (i.e. performing well on domains that differ from previous seen data)
Empirical Risk Minimization (ERM, Vapnik, 1998) - standard training
Invariant Risk Minimization (IRM, Arjovsky et al., 2019) - learns a feature representation such that the optimal linear classifier on top of that representation matches across domains.
distributional robust optimization
instead of minimizing training err, minimize maximum training err over different perturbations
Group Distributionally Robust Optimization (GroupDRO, Sagawa et al., 2020) - ERM + increase importance of domains with larger errors (see also papers from Sugiyama group e.g. 1, 2)
minimize error for worst group
Variance Risk Extrapolation (VREx, Krueger et al., 2020) - encourages robustness over affine combinations of training risks, by encouraging strict equality between training risks
Interdomain Mixup (Mixup, Yan et al., 2020) - ERM on linear interpolations of examples from random pairs of domains + their labels
Marginal Transfer Learning (MTL, Blanchard et al., 2011-2020) - augment original feature space with feature vector marginal distributions and then treat as a supervised learning problem
Meta Learning Domain Generalization (MLDG, Li et al., 2017) - use MAML to meta-learn how to generalize across domains
learning more diverse predictors
Representation Self-Challenging (RSC, Huang et al., 2020) - adds dropout-like regularization to important features, forcing model to depend on many features
Spectral Decoupling (SD, Pezeshki et al., 2020) - regularization which forces model to learn more predictive features, even when only a few suffice
embedding prior knowledge
Style Agnostic Networks (SagNet, Nam et al., 2020) - penalize style features (assumed to be spurious)
Penalizing explanations (Rieger et al. 2020) - penalize spurious features using prior knowledge
Domain adaptation under structural causal models (chen & buhlmann, 2020)
make clearer assumptions for domain adaptation to work
introduce CIRM, which works better when both covariates and labels are perturbed in target data
kernel approach (blanchard, lee & scott, 2011) - find an appropriate RKHS and optimize a regularized empirical risk over the space
In-N-Out (xie…lang, 2020) - if we have many features, rather than using them all as features, can use some as features and some as targets when we shift, to learn the domain shift
1.1.2.1. domain invariance#
key idea: want repr. to be invariant to domain label
same idea is used to learn fair representations, but domain label is replaced with sensitive attribute
Domain Adversarial Neural Network (DANN, Ganin et al., 2015)
Conditional Domain Adversarial Neural Network (CDANN, Li et al., 2018) - variant of DANN matching the conditional distributions across domains, for all labels
Deep CORAL (CORAL, Sun and Saenko, 2016) - match mean / covariance of feature distrs
Maximum Mean Discrepancy (MMD, Li et al., 2018)
adversarial discriminative domain adaptation (ADDA tzeng et al. 2017)
balancing with importance weighting
Learning Robust Representations by Projecting Superficial Statistics Out (wang et al. 2019)
1.1.2.2. feature learning#
https://arxiv.org/pdf/2212.07346.pdf (zhang & bottou, 2022) - during training, concatenate the representations obtained with different random seeds
1.1.2.3. dynamic selection#
Dynamic Selection (DS) refers to techniques in which, for a new test point, pre-trained classifiers are selected/combined from a pool at test time review paper (cruz et al. 2018), python package
define region of competence
clustering
kNN - more refined than clustering
decision space - e.g. a model’s classification boundary, internal splits in a model
potential function - weight all the points (e.g. by their distance to the query point)
criteria for selection
individual scores: acc, prob. behavior, rank, meta-learning, complexity
group: data handling, ambiguity, diversity
combination
non-trainable: mean, majority vote, product, median, etc.
trainable: learn the combination of models
related: in mixture of experts models + combination are trained jointly
dynamic weighting: combine using local competence of base classifiers
Oracle baseline - selects classifier predicts correct label, if such a classifier exists
1.1.2.4. test-time adaptation#
test-time adaptation
test-time augmentation
batch normalization (AdaBN)
label shift estimation (BBSE) - \(p(y)\) shifts but \(P(x|y)\) does not
entropy minimization (test-time entropy minimization, TENT, wang et al. 2020) - optimize for model confidence (entropy of predictions), using only norm. statistics and channel-wise affine transformations
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models (shu…anandkumar, xiao, 2022) - optimize prompt to minimize the entropy with confidence selection so that the model has consistent predictions across different augmented views of each test sample
test-time learning with rotation prediction (sun et al. 2020) - at test-time, update parameters for self-supervised rotation prediction task then use for classification
masked autoencoders (gandelsman, sun, …, efros, 2022) - use reconstructed with masked autoencoder and improve performance on robustness tasks
test-time learning for Reading Comprehension (banerjee et al. 2021) - uses self-supervision to train models on synthetically generated question-answer pairs, and then infers answers to unseen human-authored questions for this context
TTT++: When Does Self-Supervised Test-Time Training Fail or Thrive? (liu et al. 2021) - explore different test-time adaptation methods and combine Test-time feature alignment with Test-time contrastive learning
combining train-time and test-time adaptation
Adaptive Risk Minimization (ARM, Zhang et al., 2020) - combines groups at training time + batches at test-time
meta-train the model using simulated distribution shifts, which is enabled by the training groups, such that it exhibits strong post-adaptation performance on each shift
1.1.3. adv attacks#
Adversarial Attacks and Defenses in Images, Graphs and Text: A Review (xu et al. 2019)
attacks
fast gradient step method - keep adding gradient to maximize noise (limit amplitude of pixel’s channel to stay imperceptible)
Barrage of Random Transforms for Adversarially Robust Defense (raff et al. 2019)
DeepFool: a simple and accurate method to fool deep neural networks (Moosavi-Dezfooli et. al 2016)
defenses
Adversarial training - training data is augmented with adv examples (Szegedy et al., 2014b; Madry et al., 2017; Tramer et al., 2017; Yu et al., 2019)
- \[\min _{\boldsymbol{\theta}} \frac{1}{N} \sum_{n=1}^{N} \operatorname{Loss}\left(f_{\theta}\left(x_{n}\right), y_{n}\right)+\lambda\left[\max _{\|\delta\|_{\infty} \leq \epsilon} \operatorname{Loss}\left(f_{\theta}\left(x_{n}+\delta\right), y_{n}\right)\right]\]
this perspective differs from “robust statistics” which is usually robustness against some kind of model misspecification/assumptions, not to distr. shift
robust stat usually assumes a generative distr. as well
still often ends up with the same soln (e.g. ridge regr. corresponds to certain robusteness)
Stochasticity: certain inputs or hidden activations are shuffled or randomized (Xie et al., 2017; Prakash et al., 2018; Dhillon et al., 2018)
Preprocessing: inputs or hidden activations are quantized, projected into a different representation or are otherwise preprocessed (Guo et al., 2017; Buckman et al., 2018; Kabilan et al., 2018)
Manifold projections: an input sample is projected in a lower dimensional space in which the neural network has been trained to be particularly robust (Ilyas et al., 2017; Lamb et al., 2018)
Regularization in the loss function: an additional penalty term is added to the optimized objective function to upper bound or to approximate the adversarial loss (Hein and Andriushchenko, 2017; Yan et al., 2018)
constraint
robustness as a constraint not a loss (Constrained Learning with Non-Convex Losses (chamon et al. 2021))
- \[\begin{split}\begin{aligned} \min _{\boldsymbol{\theta}} & \frac{1}{N} \sum_{n=1}^{N} \operatorname{Loss}\left(f_{\theta}\left(x_{n}\right), y_{n}\right) \\ \text { subject to } & \frac{1}{N} \sum_{n=1}^{N}\left[\max _{\|\delta\|_{\infty} \leq \epsilon} \operatorname{Loss}\left(f_{\theta}\left(\boldsymbol{x}_{n}+\delta\right), y_{n}\right)\right] \leq c \end{aligned}\end{split}\]
when penalty is convex, these 2 problems are the same
a possible defense against adversarial attacks is to solve the anticausal classification problem by modeling the causal generative direction, a method which in vision is referred to as analysis by synthesis (Schott et al., 2019)
robustness vs accuracy
robustness may be at odds with accuracy (tsipiras…madry, 2019)
Precise Tradeoffs in Adversarial Training for Linear Regression (javanmard et al. 2020) - linear regression with gaussian features
use adv. training formula above
Theoretically Principled Trade-off between Robustness and Accuracy (Zhang, …, el ghaoui, Jordan, 2019)
adversarial examples
Decision Boundary Analysis of Adversarial Examples (He, Li, & Song 2019)
Natural Adversarial Examples (Hendrycks, Zhao, Basart, Steinhardt, & Song 2020)
Image-Net-Trained CNNs Are Biased Towards Texture (Geirhos et al. 2019)
adversarial transferability
Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples (papernot, mcdaniel, & goodfellow, 2016)
Ensemble Adversarial Training: Attacks and Defenses (tramer et al. 2018)
Improving Adversarial Robustness via Promoting Ensemble Diversity (pang et al. 2019)
encourage diversity in non-maximal predictions
robustness
smoothness yields robustness (but can be robust without smoothness)
margin idea - data points close to the boundary are not robust
we want our boundary to go through regions where data is scarce
1.1.3.1. nlp#
QData/TextAttack: TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP https://textattack.readthedocs.io/en/master/ (only the attacks on classification tasks, like sentiment classification and entailment)
Attack Recipe Name | Goal Function | ConstraintsEnforced | Transformation | Search Method | Main Idea |
---|---|---|---|---|---|
a2t
|
Untargeted {Classification, Entailment} | Percentage of words perturbed, Word embedding distance, DistilBERT sentence encoding cosine similarity, part-of-speech consistency | Counter-fitted word embedding swap (or) BERT Masked Token Prediction | Greedy-WIR (gradient) | from "Towards Improving Adversarial Training of NLP Models" (Yoo et al., 2021) |
alzantot |
Untargeted {Classification, Entailment} | Percentage of words perturbed, Language Model perplexity, Word embedding distance | Counter-fitted word embedding swap | Genetic Algorithm | from "Generating Natural Language Adversarial Examples" (Alzantot et al., 2018) |
bae |
Untargeted Classification | USE sentence encoding cosine similarity | BERT Masked Token Prediction | Greedy-WIR | BERT masked language model transformation attack from ("BAE: BERT-based Adversarial Examples for Text Classification" (Garg & Ramakrishnan, 2019)). |
bert-attack |
Untargeted Classification | USE sentence encoding cosine similarity, Maximum number of words perturbed | BERT Masked Token Prediction (with subword expansion) | Greedy-WIR | ("BERT-ATTACK: Adversarial Attack Against BERT Using BERT" (Li et al., 2020)) |
checklist |
{Untargeted, Targeted} Classification | checklist distance | contract, extend, and substitutes name entities | Greedy-WIR | Invariance testing implemented in CheckList . ("Beyond Accuracy: Behavioral Testing of NLP models with CheckList" (Ribeiro et al., 2020)) |
clare |
Untargeted {Classification, Entailment} | USE sentence encoding cosine similarity | RoBERTa Masked Prediction for token swap, insert and merge | Greedy | ["Contextualized Perturbation for Textual Adversarial Attack" (Li et al., 2020)](https://arxiv.org/abs/2009.07502)) |
deepwordbug |
{Untargeted, Targeted} Classification | Levenshtein edit distance | {Character Insertion, Character Deletion, Neighboring Character Swap, Character Substitution} | Greedy-WIR | Greedy replace-1 scoring and multi-transformation character-swap attack (["Black-box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers" (Gao et al., 2018)](https://arxiv.org/abs/1801.04354) |
fast-alzantot |
Untargeted {Classification, Entailment} | Percentage of words perturbed, Language Model perplexity, Word embedding distance | Counter-fitted word embedding swap | Genetic Algorithm | Modified, faster version of the Alzantot et al. genetic algorithm, from (["Certified Robustness to Adversarial Word Substitutions" (Jia et al., 2019)](https://arxiv.org/abs/1909.00986)) |
hotflip (word swap) |
Untargeted Classification | Word Embedding Cosine Similarity, Part-of-speech match, Number of words perturbed | Gradient-Based Word Swap | Beam search | (["HotFlip: White-Box Adversarial Examples for Text Classification" (Ebrahimi et al., 2017)](https://arxiv.org/abs/1712.06751)) |
iga |
Untargeted {Classification, Entailment} | Percentage of words perturbed, Word embedding distance | Counter-fitted word embedding swap | Genetic Algorithm | Improved genetic algorithm -based word substitution from (["Natural Language Adversarial Attacks and Defenses in Word Level (Wang et al., 2019)"](https://arxiv.org/abs/1909.06723) |
input-reduction |
Input Reduction | Word deletion | Greedy-WIR | Greedy attack with word importance ranking , Reducing the input while maintaining the prediction through word importance ranking (["Pathologies of Neural Models Make Interpretation Difficult" (Feng et al., 2018)](https://arxiv.org/pdf/1804.07781.pdf)) | |
kuleshov |
Untargeted Classification | Thought vector encoding cosine similarity, Language model similarity probability | Counter-fitted word embedding swap | Greedy word swap | (["Adversarial Examples for Natural Language Classification Problems" (Kuleshov et al., 2018)](https://openreview.net/pdf?id=r1QZ3zbAZ)) |
pruthi |
Untargeted Classification | Minimum word length, Maximum number of words perturbed | {Neighboring Character Swap, Character Deletion, Character Insertion, Keyboard-Based Character Swap} | Greedy search | simulates common typos (["Combating Adversarial Misspellings with Robust Word Recognition" (Pruthi et al., 2019)](https://arxiv.org/abs/1905.11268) |
pso |
Untargeted Classification | HowNet Word Swap | Particle Swarm Optimization | (["Word-level Textual Adversarial Attacking as Combinatorial Optimization" (Zang et al., 2020)](https://www.aclweb.org/anthology/2020.acl-main.540/)) | |
pwws |
Untargeted Classification | WordNet-based synonym swap | Greedy-WIR (saliency) | Greedy attack with word importance ranking based on word saliency and synonym swap scores (["Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency" (Ren et al., 2019)](https://www.aclweb.org/anthology/P19-1103/)) | |
textbugger : (black-box) |
Untargeted Classification | USE sentence encoding cosine similarity | {Character Insertion, Character Deletion, Neighboring Character Swap, Character Substitution} | Greedy-WIR | ([(["TextBugger: Generating Adversarial Text Against Real-world Applications" (Li et al., 2018)](https://arxiv.org/abs/1812.05271)). |
textfooler |
Untargeted {Classification, Entailment} | Word Embedding Distance, Part-of-speech match, USE sentence encoding cosine similarity | Counter-fitted word embedding swap | Greedy-WIR | Greedy attack with word importance ranking (["Is Bert Really Robust?" (Jin et al., 2019)](https://arxiv.org/abs/1907.11932)) |