Generalization of the hand-written fMRI encoders

We took a handful of the best, non-redundant hand-written models from each evolutionary run and tested how well they transfer (1) to new subjects (UTS01, UTS02) on the same stories, and (2) to new stories (held-out stories from the training pool) on the same subject (UTS03). Every number on this page — including each model's "original" score — is re-measured under one identical pipeline (10-gram features, 30-TR edge trim, ndelays=4, 8 training stories, bootstrapped ridge), so original-vs-new comparisons are apples-to-apples. See the original report (report.html) for the full evolution curves. Only genuinely hand-wired models are tested here. Disallowed iterations flagged in the headline report are excluded from selection: Jun-03 run 3's external pretrained encoders (Qwen / DistilBERT / RoBERTa / GloVe) and trained model, and Jun-04 run 1's late corpus-statistics push (an n-gram surprisal language model + LSA / PPMI co-occurrence word vectors built from the stimulus text, which reached ~0.088 but breaks the "no corpus statistics" rule). All six runs are represented. Jun-04 run 1 (Claude Opus 4.8, xhigh) was given all prior runs' results and resumed run 4's FeatBag; its best legitimate model adds a content-free within-story novelty block (first-mention / repetition-suppression, computed per story at inference) plus a name gazetteer (FeatBagNovelty_NamesDense_xrun, 0.084 — itself just past GPT-2 XL), which we test here as a genuinely new mechanism alongside the FeatBag family.

Two differences from the headline report. (1) The headline report.html scored each run with its own recorded number; here we re-run every model through the same fixed pipeline, so the "original" bars can differ slightly from the headline values (and substantially for the May-27 run). (2) The May-27 models were originally scored untrimmed (and with ndelays=3); re-measured here with 30-TR trimming + ndelays=4, their "original" score drops well below the 0.11 reported there. Their headline (untrimmed) numbers are shown in the recap table for reference only.

Key findings

1. Cross-subject transfer is partial but consistent. Every one of the 17 models keeps roughly half of its correlation when moved to a new subject (subject retention ~42%–62%), with UTS02 transferring better than UTS01 for every model. So the hand-built circuits capture genuine, subject-general language signal — but a substantial part of each model's score is subject-specific (the ridge readout is always refit per subject; the feature circuit is what transfers).

2. Cross-story transfer is weaker than cross-subject, but holds up. Averaged over all 85 held-out shared stories, models keep ~33%–47% (mean ~38%) of their original correlation — below the cross-subject ~42%–62%, but well above what a small sample implied. (An earlier 3-story version of this test gave only ~19–32%: those particular stories were a noisy, pessimistic draw. Averaging over the full shared set raises and stabilizes the estimate, and the new-story score now tracks the original ranking — the strongest original models stay strongest — rather than looking flat.) New stories remain the harder axis of generalization, just not as severe as the 3-story sample suggested.

3. Model rankings are preserved. The strongest original models (LexFeat / FeatBag / WordNet families) stay at or near the top in both new settings, so the relative conclusions from the evolutionary runs hold up under transfer — the differences are attenuated, not reordered.

4. The May-27 models are not special once measured consistently. Re-measured under the same trimmed pipeline, WordNetMorphLingPerceptual drops from its reported (untrimmed) 0.115 to 0.064, and its transfer is in line with the other runs.

Original runs (recap)

The six evolutionary runs and the models sampled from each for this experiment.

Run	Driver model	Effort	Untrimmed	Models tested here
Jun 03 · run 1	Claude Opus 4.8	medium	no	LexFeatFreqMerge, LexFeatBoC
Jun 03 · run 2	GPT-5.5	xhigh	no	semantic_bestlex_compact_v94, lexsem_rec70_tail24_v09, content_structure_v13
Jun 03 · run 3	Gemini 3.1 Pro	high	no	WordBoundaryFeatures, Deep_EnsembleWB_Tuned, MultiScale_Temporal_Pool
Jun 03 · run 4	Claude Opus 4.7	xhigh	no	FeatBag_v1116_Emo40, FeatBag_v11_MoreSEM, FeatBag_v2_WordID
May 27 · run 1	Claude Opus 4.7	xhigh	yes	WordNetMorphLingPerceptual, WordNetMorphLingMultiTau
Jun 04 · run 1	Claude Opus 4.8	xhigh	no	FeatBagNovelty_NamesDense_xrun, FeatBag3Head_EmoInt_ConcCat, N3_OtherRefBonus8, E24_ContentOnly

New subjects — same stories, train+test on UTS01 / UTS02

Each model's feature circuit is fixed; only the ridge readout is refit per subject (the encoding model is always subject-specific). Bars compare the original subject (UTS03) with the two new subjects on the identical test stories.

Grouped by model (colored by source run). Higher = better transfer.

Summary — original vs. new-subject correlation

Each marker is one model on one new subject; the dashed line is y = x (perfect transfer). Points below the line lost correlation on the new subject.

New stories — same subject (UTS03), held-out stories

Same subject and the same fixed circuits, but evaluated on all 85 shared stories that were never used for training (the 8 fit stories) or testing (the 3 original test stories) — i.e. the entire remainder of the cross-subject shared story pool, not just a 3-story sample. The ridge readout is refit on the original 8 training stories; only the held-out evaluation set changes. Averaging over 85 stories makes the new-story estimate far more stable than the earlier 3-story version.

Original test stories vs. the new held-out stories, per model.

Summary — original vs. new-story correlation

Each marker is one model; dashed line is y = x.

Transfer retention

New-setting correlation as a fraction of the (re-measured) original. ~1.0 means the model transferred with little loss.

Run	Model	orig	UTS01	UTS02	new stories	subj retention	story retention

Generated from generalization/results.csv (harness: generalization/harness.py, models from runs-neuro/*/interpretable_transformers_lib/). Pipeline: 30-TR edge trim, 10-gram, ndelays=4, 8 train stories, bootstrapped ridge over ~95.6k voxels.