transfer learning view markdown

See also notes on đź“Ś causal inference for some close connections.

For neural-net specific transferring see đź“Ś adaption/transfer.


transfer_taxonomy (from this paper)

domain adaptation algorithms

Domain test bed available here, for generalizating to new domains (i.e. performing well on domains that differ from previous seen data)

  • Empirical Risk Minimization (ERM, Vapnik, 1998) - standard training
  • Invariant Risk Minimization (IRM, Arjovsky et al., 2019) - learns a feature representation such that the optimal linear classifier on top of that representation matches across domains.
  • distributional robust optimization
    • instead of minimizing training err, minimize maximum training err over different perturbations
    • Group Distributionally Robust Optimization (GroupDRO, Sagawa et al., 2020) - ERM + increase importance of domains with larger errors (see also papers from Sugiyama group e.g. 1, 2)
      • minimize error for worst group
    • Variance Risk Extrapolation (VREx, Krueger et al., 2020) - encourages robustness over affine combinations of training risks, by encouraging strict equality between training risks
  • Interdomain Mixup (Mixup, Yan et al., 2020) - ERM on linear interpolations of examples from random pairs of domains + their labels
  • Marginal Transfer Learning (MTL, Blanchard et al., 2011-2020) - augment original feature space with feature vector marginal distributions and then treat as a supervised learning problem
  • Meta Learning Domain Generalization (MLDG, Li et al., 2017) - use MAML to meta-learn how to generalize across domains
  • MAML (finn, abbeel, & levine, 2017) - minimize parameters for metalearning including finetuning as part of the process (intuitively, find parameters that improve performance on a task after finetuning on that task)
    • $\min \theta \underbrace{\mathbb{E}\tau }{\text{average over tasks } \tau}\left[\mathcal{L}\tau\left(\underbrace{U_\tau(\theta)}_{\text{finetuned model}}\right)\right]$​
      • compute finetuned models then take gradient wrt to held-out samples from the same tasks
  • learning more diverse predictors
    • Representation Self-Challenging (RSC, Huang et al., 2020) - adds dropout-like regularization to important features, forcing model to depend on many features
    • Spectral Decoupling (SD, Pezeshki et al., 2020) - regularization which forces model to learn more predictive features, even when only a few suffice
  • embedding prior knowledge
    • Style Agnostic Networks (SagNet, Nam et al., 2020) - penalize style features (assumed to be spurious)
    • Penalizing explanations (Rieger et al. 2020) - penalize spurious features using prior knowledge
  • Domain adaptation under structural causal models (chen & buhlmann, 2020)
    • make clearer assumptions for domain adaptation to work
    • introduce CIRM, which works better when both covariates and labels are perturbed in target data
  • kernel approach (blanchard, lee & scott, 2011) - find an appropriate RKHS and optimize a regularized empirical risk over the space
  • In-N-Out (xie…lang, 2020) - if we have many features, rather than using them all as features, can use some as features and some as targets when we shift, to learn the domain shift

domain invariance

key idea: want repr. to be invariant to domain label

feature learning

  • (zhang & bottou, 2022) - during training, concatenate the representations obtained with different random seeds

dynamic selection

Dynamic Selection (DS) refers to techniques in which, for a new test point, pre-trained classifiers are selected/combined from a pool at test time review paper (cruz et al. 2018), python package

  1. define region of competence
    1. clustering
    2. kNN - more refined than clustering
    3. decision space - e.g. a model’s classification boundary, internal splits in a model
    4. potential function - weight all the points (e.g. by their distance to the query point)
  2. criteria for selection
    1. individual scores: acc, prob. behavior, rank, meta-learning, complexity
    2. group: data handling, ambiguity, diversity
  3. combination
    1. non-trainable: mean, majority vote, product, median, etc.
    2. trainable: learn the combination of models
      1. related: in mixture of experts models + combination are trained jointly
    3. dynamic weighting: combine using local competence of base classifiers
    4. Oracle baseline - selects classifier predicts correct label, if such a classifier exists

test-time adaptation

  • test-time adaptation
  • test-time learning with rotation prediction (sun et al. 2020) - at test-time, update parameters for self-supervised rotation prediction task then use for classification
    • masked autoencoders (gandelsman, sun, …, efros, 2022) - use reconstructed with masked autoencoder and improve performance on robustness tasks
    • test-time learning for Reading Comprehension (banerjee et al. 2021) - uses self-supervision to train models on synthetically generated question-answer pairs, and then infers answers to unseen human-authored questions for this context
    • TTT++: When Does Self-Supervised Test-Time Training Fail or Thrive? (liu et al. 2021) - explore different test-time adaptation methods and combine Test-time feature alignment with Test-time contrastive learning
  • combining train-time and test-time adaptation
    • Adaptive Risk Minimization (ARM, Zhang et al., 2020) - combines groups at training time + batches at test-time
      • meta-train the model using simulated distribution shifts, which is enabled by the training groups, such that it exhibits strong post-adaptation performance on each shift

adv attacks