transfer learning view markdown
See also notes on đź“Ś causal inference for some close connections.
For neuralnet specific transferring see đź“Ś adaption/transfer.
overviews
(from this paper)
domain adaptation algorithms
Domain test bed available here, for generalizating to new domains (i.e. performing well on domains that differ from previous seen data)
 Empirical Risk Minimization (ERM, Vapnik, 1998)  standard training
 Invariant Risk Minimization (IRM, Arjovsky et al., 2019)  learns a feature representation such that the optimal linear classifier on top of that representation matches across domains.
 distributional robust optimization
 instead of minimizing training err, minimize maximum training err over different perturbations
 Group Distributionally Robust Optimization (GroupDRO, Sagawa et al., 2020)  ERM + increase importance of domains with larger errors (see also papers from Sugiyama group e.g. 1, 2)
 minimize error for worst group
 Variance Risk Extrapolation (VREx, Krueger et al., 2020)  encourages robustness over affine combinations of training risks, by encouraging strict equality between training risks
 Interdomain Mixup (Mixup, Yan et al., 2020)  ERM on linear interpolations of examples from random pairs of domains + their labels
 Marginal Transfer Learning (MTL, Blanchard et al., 20112020)  augment original feature space with feature vector marginal distributions and then treat as a supervised learning problem
 MAML (finn, abbeel, & levine, 2017)  minimize parameters for metalearning including finetuning as part of the process (intuitively, find parameters that improve performance on a task after finetuning on that task)
 $\min \theta \underbrace{\mathbb{E}\tau }{\text{average over tasks } \tau}\left[\mathcal{L}\tau\left(\underbrace{U_\tau(\theta)}_{\text{finetuned model}}\right)\right]$â€‹
 compute finetuned models then take gradient wrt to heldout samples from the same tasks
 Meta Learning Domain Generalization (MLDG, Li et al., 2017)  use MAML to metalearn how to generalize across domains
 $\min \theta \underbrace{\mathbb{E}\tau }{\text{average over tasks } \tau}\left[\mathcal{L}\tau\left(\underbrace{U_\tau(\theta)}_{\text{finetuned model}}\right)\right]$â€‹
 learning more diverse predictors
 Representation SelfChallenging (RSC, Huang et al., 2020)  adds dropoutlike regularization to important features, forcing model to depend on many features
 Spectral Decoupling (SD, Pezeshki et al., 2020)  regularization which forces model to learn more predictive features, even when only a few suffice
 embedding prior knowledge
 Style Agnostic Networks (SagNet, Nam et al., 2020)  penalize style features (assumed to be spurious)
 Penalizing explanations (Rieger et al. 2020)  penalize spurious features using prior knowledge
 Domain adaptation under structural causal models (chen & buhlmann, 2020)
 make clearer assumptions for domain adaptation to work
 introduce CIRM, which works better when both covariates and labels are perturbed in target data
 kernel approach (blanchard, lee & scott, 2011)  find an appropriate RKHS and optimize a regularized empirical risk over the space
 InNOut (xieâ€¦lang, 2020)  if we have many features, rather than using them all as features, can use some as features and some as targets when we shift, to learn the domain shift
domain invariance
key idea: want repr. to be invariant to domain label
 same idea is used to learn fair representations, but domain label is replaced with sensitive attribute
 Domain Adversarial Neural Network (DANN, Ganin et al., 2015)
 Conditional Domain Adversarial Neural Network (CDANN, Li et al., 2018)  variant of DANN matching the conditional distributions across domains, for all labels
 Deep CORAL (CORAL, Sun and Saenko, 2016)  match mean / covariance of feature distrs
 Maximum Mean Discrepancy (MMD, Li et al., 2018)
 adversarial discriminative domain adaptation (ADDA tzeng et al. 2017)
 balancing with importance weighting
 Learning Robust Representations by Projecting Superficial Statistics Out (wang et al. 2019)
feature learning
 https://arxiv.org/pdf/2212.07346.pdf (zhang & bottou, 2022)  during training, concatenate the representations obtained with different random seeds
dynamic selection
Dynamic Selection (DS) refers to techniques in which, for a new test point, pretrained classifiers are selected/combined from a pool at test time review paper (cruz et al. 2018), python package
 define region of competence
 clustering
 kNN  more refined than clustering
 decision space  e.g. a modelâ€™s classification boundary, internal splits in a model
 potential function  weight all the points (e.g. by their distance to the query point)
 criteria for selection
 individual scores: acc, prob. behavior, rank, metalearning, complexity
 group: data handling, ambiguity, diversity
 combination
 nontrainable: mean, majority vote, product, median, etc.
 trainable: learn the combination of models
 related: in mixture of experts models + combination are trained jointly
 dynamic weighting: combine using local competence of base classifiers
 Oracle baseline  selects classifier predicts correct label, if such a classifier exists
testtime adaptation
 testtime adaptation
 testtime augmentation
 batch normalization (AdaBN, liâ€¦hou, 2016)

label shift estimation (BBSE, lipton, wang, & smola, 2018)  $p(y)$ shifts but $P(x y)$ does not  entropy minimization (testtime entropy minimization, TENT, wangâ€¦olshausen, darrel, 2020)  optimize for model confidence (entropy of predictions), using only norm. statistics and channelwise affine transformations
 TestTime Prompt Tuning for ZeroShot Generalization in VisionLanguage Models (shuâ€¦anandkumar, xiao, 2022)  optimize prompt to minimize the entropy with confidence selection so that the model has consistent predictions across different augmented views of each test sample
 combining traintime and testtime adaptation: Adaptive Risk Minimization (ARM, zhang et al., 2020)  combines groups at training time + batches at testtime
 metatrain the model using simulated distribution shifts, which is enabled by the training groups, such that it exhibits strong postadaptation performance on each shift
 TTT: TestTime Training with SelfSupervision for Generalization under Distribution Shifts (sunâ€¦efros, hardt. 2020)  at testtime, update parameters for selfsupervised rotation prediction task then use for classification
 TTT with masked autoencoders (gandelsman, sun, â€¦, efros, 2022)  use reconstructed with masked autoencoder and improve performance on robustness tasks
 TestTime Training on Video Streams (wang, sun, â€¦, efros, wang, 2023)  use masked autoencoder perframe and incrementally update across each frame
 MTTT: Learning to (Learn at Test Time) (sun, li, dalal, â€¦guestrin, 2024)
 analysis
 TTT for Reading Comprehension (banerjee et al. 2021)  uses selfsupervision to train models on synthetically generated questionanswer pairs, and then infers answers to unseen humanauthored questions for this context
 TTT++: When Does SelfSupervised TestTime Training Fail or Thrive? (liu et al. 2021)  explore different testtime adaptation methods and combine Testtime feature alignment with Testtime contrastive learning
 Never Train from Scratch: Fair Comparison of LongSequence Models Requires DataDriven Priors (amos, berant, & gupta, 2024)  when transferring to a specific task, should first finetune the LM on the task before doing supervised finetuning
adv attacks
 Adversarial Attacks and Defenses in Images, Graphs and Text: A Review (xu et al. 2019)
 attacks
 fast gradient step method  keep adding gradient to maximize noise (limit amplitude of pixelâ€™s channel to stay imperceptible)
 Barrage of Random Transforms for Adversarially Robust Defense (raff et al. 2019)
 DeepFool: a simple and accurate method to fool deep neural networks (MoosaviDezfooli et. al 2016)
 defenses
 Adversarial training  training data is augmented with adv examples (Szegedy et al., 2014b; Madry et al., 2017; Tramer et al., 2017; Yu et al., 2019)
 \[\min _{\boldsymbol{\theta}} \frac{1}{N} \sum_{n=1}^{N} \operatorname{Loss}\left(f_{\theta}\left(x_{n}\right), y_{n}\right)+\lambda\left[\max _{\\delta\_{\infty} \leq \epsilon} \operatorname{Loss}\left(f_{\theta}\left(x_{n}+\delta\right), y_{n}\right)\right]\]
 this perspective differs from â€śrobust statisticsâ€ť which is usually robustness against some kind of model misspecification/assumptions, not to distr. shift
 robust stat usually assumes a generative distr. as well
 still often ends up with the same soln (e.g. ridge regr. corresponds to certain robusteness)
 Stochasticity: certain inputs or hidden activations are shuffled or randomized (Xie et al., 2017; Prakash et al., 2018; Dhillon et al., 2018)
 Preprocessing: inputs or hidden activations are quantized, projected into a different representation or are otherwise preprocessed (Guo et al., 2017; Buckman et al., 2018; Kabilan et al., 2018)
 Manifold projections: an input sample is projected in a lower dimensional space in which the neural network has been trained to be particularly robust (Ilyas et al., 2017; Lamb et al., 2018)
 Regularization in the loss function: an additional penalty term is added to the optimized objective function to upper bound or to approximate the adversarial loss (Hein and Andriushchenko, 2017; Yan et al., 2018)
 constraint
 robustness as a constraint not a loss (Constrained Learning with NonConvex Losses (chamon et al. 2021))
 \[\begin{aligned} \min _{\boldsymbol{\theta}} & \frac{1}{N} \sum_{n=1}^{N} \operatorname{Loss}\left(f_{\theta}\left(x_{n}\right), y_{n}\right) \\ \text { subject to } & \frac{1}{N} \sum_{n=1}^{N}\left[\max _{\\delta\_{\infty} \leq \epsilon} \operatorname{Loss}\left(f_{\theta}\left(\boldsymbol{x}_{n}+\delta\right), y_{n}\right)\right] \leq c \end{aligned}\]
 when penalty is convex, these 2 problems are the same
 robustness as a constraint not a loss (Constrained Learning with NonConvex Losses (chamon et al. 2021))
 a possible defense against adversarial attacks is to solve the anticausal classification problem by modeling the causal generative direction, a method which in vision is referred to as analysis by synthesis (Schott et al., 2019)
 Adversarial training  training data is augmented with adv examples (Szegedy et al., 2014b; Madry et al., 2017; Tramer et al., 2017; Yu et al., 2019)
 robustness vs accuracy
 robustness may be at odds with accuracy (tsipirasâ€¦madry, 2019)
 Precise Tradeoffs in Adversarial Training for Linear Regression (javanmard et al. 2020)  linear regression with gaussian features
 use adv. training formula above
 Theoretically Principled Tradeoff between Robustness and Accuracy (Zhang, â€¦, el ghaoui, Jordan, 2019)
 adversarial examples
 Decision Boundary Analysis of Adversarial Examples (He, Li, & Song 2019)
 Natural Adversarial Examples (Hendrycks, Zhao, Basart, Steinhardt, & Song 2020)
 ImageNetTrained CNNs Are Biased Towards Texture (Geirhos et al. 2019)
 adversarial transferability
 Transferability in Machine Learning: from Phenomena to BlackBox Attacks using Adversarial Samples (papernot, mcdaniel, & goodfellow, 2016)
 Ensemble Adversarial Training: Attacks and Defenses (tramer et al. 2018)
 Improving Adversarial Robustness via Promoting Ensemble Diversity (pang et al. 2019)
 encourage diversity in nonmaximal predictions
 robustness
 smoothness yields robustness (but can be robust without smoothness)
 margin idea  data points close to the boundary are not robust
 we want our boundary to go through regions where data is scarce