Brain Morphometry
Machine Learning Experiments

Multi-modal Prediction

Age 55-89 | FreeSurfer + Cognitive Assessment

Multi-modal Data Sources

Neuroimaging

MRI: 56 brain regions (FreeSurfer), Total Intracranial Volume

Cognitive Assessment

RAVLT Memory: Learning trials (A1-A7), interference (B1), delayed recall

Language: Naming & comprehension tasks (accuracy + reaction time)

Speech & Demographics

Speech Fluency: 154 variables (pause, phonation, rate, voicing)

Demographics: Age, sex, education, MoCA, MMSE, depression

Data Structure: Long Format Example

Dataset: 12,992 rows × 76 columns

Structure: 232 participants × 56 brain regions = 12,992 observations

Format: Each participant has 56 rows (one per brain region)

ID Region GM_Vol age sex MoCA naming_acc_% RAVLT_A1 RAVLT_A5 RAVLT_A7
AA1010 lSupFroG 29.9 74 F 24 90% 7 14 12
AA1010 rSupFroG 28.1 74 F 24 90% 7 14 12
AA1010 lMidFroG 20.9 74 F 24 90% 7 14 12
... 53 more brain regions for AA1010 ...
AB4388 lSupFroG 31.2 76 F 26 100% 5 11 9
AB4388 rSupFroG 29.8 76 F 26 100% 5 11 9

Note: Cognitive scores (MoCA, Naming, RAVLT) are repeated across all brain regions for each participant. Brain volume varies by region.

Data Characteristics

232 Participants
70±7 Age (years)
84% F Gender

Variable Ranges (Scenario 1)

Variable Range Mean ± SD Missing
Age 55 - 89 years 70.2 ± 7.0 0%
Education 0 - 21 years 14.4 ± 2.7 0.9%
MoCA (cognitive screening) 13 - 29 / 30 23.9 ± 3.2 0.4%
Naming accuracy 3% - 100% 89.0 ± 14.2% 9.1%
RAVLT A1 (first trial) 1 - 10 words 5.1 ± 1.7 0%
RAVLT A5 (fifth trial) 0 - 15 words 10.8 ± 2.9 0%
RAVLT A7 (delayed recall) 0 - 15 words 8.8 ± 3.6 0%
Gray matter volume 1 - 107 mm³ Varies by region 0%
Data Quality: Excellent coverage with minimal missing data. Memory tests (RAVLT) have complete data. Language tasks have ~9% missingness but still provide N=211 for analysis.

Three Analysis Scenarios

Scenario Data Merged N Purpose
Scenario 1 MRI + RAVLT + Language 232 Brain-memory-language
Scenario 2 MRI + Language + Speech 197 Brain-speech production
Scenario 3 All modalities 183 Rigorous multi-modal validation
Design principle: Inner joins ensure complete data. Sample size decreases with more modalities merged.

Scenario 1: Brain-Memory-Language

232 Participants
27 Tasks
~150 Experiments

Data Structure: Long format with 12,992 rows (232 participants × 56 brain regions)

Coverage: 112 MRI regions + RAVLT memory scores + Language tasks (naming, comprehension) + Demographics

Design: Phase 1-3 (9 initial tasks, 42 models) + Phase 4 (18 new tasks)

Scenario 1: Phase 1-3 Initial Tasks (9 tasks)

Description N Features Target Type Best Model Performance
MRI → MoCA 231 56 GM + 56 WM regions MoCA total score Regression ExtraTrees R²=0.126
MRI → MoCA<26 232 56 GM + 56 WM regions MoCA < 26 Classification ExtraTrees AUC=0.658
MRI → Naming accuracy 211 56 GM + 56 WM regions naming_accuracy_% Regression ElasticNet R²=0.016
MRI → Naming binary 232 56 GM + 56 WM regions naming_acc < 80% Classification GaussianNB AUC=0.615
Cognition → Naming 210 MoCA + MMSE + age naming_accuracy_% Regression ElasticNet R²=0.098
MRI+Cog → RAVLT A5 231 56 GM + 56 WM + MoCA + age RAVLT_A5 (trial 5) Regression ExtraTrees R²=0.195
MRI+Demo → Age group 231 56 GM + 56 WM + sex + edu age_group (<70 vs ≥70) Classification SVC-RBF AUC=0.768 ✓
MRI clustering 232 56 GM + 56 WM regions - Clustering KMeans-2 Sil=0.223
Time+MRI → RAVLT A5 232 56 GM + 56 WM + time_diff RAVLT_A5 Regression ExtraTrees R²=-0.003

Phase 1-3 success rate: 1/9 (11%) | ~150 experiments (9 tasks × 18 models)

Scenario 1: Phase 4 New Tasks (10 analyzed)

Category Description N Features Target Type Performance
Behavioral A7 delayed recall prediction 204 A1-A5,B6+naming/comp acc+MoCA+MMSE+age+edu+sex RAVLT_A7 Regression R²=0.782 ✓
RAVLT learning curve clustering 232 A1,A2,A3,A4,A5,A7,A8,B6 cluster_label Clustering Sil=0.405 ✓
Fast/slow learners 232 A1,A2,A3,A4,A5,A7,A8,B6 learning_speed (binary) Classification AUC=0.69
RT consistency analysis 211 naming_rt, comprehension_rt Pearson correlation Correlation r=0.382*** ✓
RT-accuracy tradeoff × age 210 naming_rt, naming_acc, age RT×age interaction ANCOVA t=2.87** ✓
Memory → Language 210 A1,A2,A3,A4,A5,A7,A8,B6 naming_accuracy_% Regression R²=0.08
Education × Age interaction 231 edu_years, age, edu×age MoCA total Regression R²=0.07
MRI refined ROI-specific prediction 231 Temporal+IFG GM regions naming_accuracy_% Regression R²<0
Gray matter vs white matter 231 56 GM vs 56 WM (separate models) MoCA total Regression R²<0
Multi-modal Age → Hippocampus → Memory mediation 231 age → hippocampus_GM → RAVLT_A5 indirect_effect Mediation 27.2% ⚠

Phase 4 success rate: 4/10 analyzed (40%) | Key pattern: Behavioral 40% vs MRI 0%

Overall Scenario 1: 4/27 tasks succeeded (14.8%)

Scenario 2: Brain-Language-Speech

197 Participants
40 Tasks
428 Experiments

Data Structure: Long format with 11,032 rows (197 participants × 56 brain regions)

Coverage: 112 MRI regions + 154 speech features + Language tasks + HADS mental health

Design: 7 categories (40 tasks × 11 models) | 3-round methodological audit

Scenario 2 Tasks: Brain ↔ Behavior (14 tasks)

Category A: Brain → Behavior (8 tasks)

Description Features Target Type Best Model Performance
MRI → Speech rate 56 GM + 56 WM regions speech_rate (words/sec) Regression - R² < 0
MRI → Speech clarity 56 GM + 56 WM regions speech_clarity_score Regression - R² < 0
MRI → Naming accuracy Temporal+IFG GM regions naming_accuracy_% Regression - R² < 0
MRI → Age group 56 GM + 56 WM regions age_group (<70 vs ≥70) Classification LogisticReg AUC=0.759 ✓
MRI → Cognitive risk 56 GM + 56 WM regions cognitive_risk (MoCA<26 | naming<80%) Classification - AUC < 0.65
MRI → Depression risk 56 GM + 56 WM regions depression_risk (HADS_D ≥ 8) Classification Lasso AUC=0.655

Category B: Behavior → Brain (6 tasks)

Description Features Target Type Best Model Performance
Speech → Brain health index 154 speech fluency vars (pause/phonation/rate/voicing) brain_health_index (MRI composite) Regression - R² < 0
Behavior → Cognitive risk 154 speech + naming/comp + MoCA + demographics cognitive_risk (binary) Classification AdaBoost AUC=0.685 ✓
Speech → Frontal GM 154 speech fluency vars frontal_lobe_GM_volume Regression - R² < 0
Speech → Temporal GM 154 speech fluency vars temporal_lobe_GM_volume Regression - R² < 0
Behavior → Age 154 speech + naming/comp + MoCA age (continuous) Regression - R² < 0
Speech rate → Gender speech_rate (single feature) sex (M/F) Classification ElasticNet AUC=0.669

Category A+B: 3/14 tasks succeeded (21%) | Pattern: MRI predicts age well, behavior→brain mostly fails

Scenario 2 Tasks: Mental Health (6 tasks)

Description Features Target Type Best Model Performance
Speech → HADS Depression score 154 speech fluency vars HADS_depression_score (0-21) Regression - R² < 0
Speech → HADS Anxiety score 154 speech fluency vars HADS_anxiety_score (0-21) Regression - R² < 0
Pause composite → Depression risk pause_composite (single feature) depression_risk (HADS_D ≥ 8) Classification Multiple AUC=0.621 ✓ MOST RELIABLE
MRI+Speech → Depression risk 56 GM + 56 WM + selected speech vars (60+ total) depression_risk (HADS_D ≥ 8) Classification - AUC=0.640
Mental health → Speech HADS_depression + HADS_anxiety 154 speech features (multi-target) Regression - R² < 0
Demographics+Mental → Cognitive risk age + sex + edu + HADS_D + HADS_A cognitive_risk (binary) Classification - AUC < 0.65
⭐ Key finding: Simple pause pattern (1 feature) outperforms complex multi-modal models (60+ features). Only this task survived 3-round methodological audit.

Category E: 1/6 tasks succeeded (17%) | Depression detectable from speech pauses

Scenario 2 Tasks: Other Categories (20 tasks)

Category Features → Target Count Type Success Result
C: Speech → Language/Cognition 154 speech fluency → naming_acc, comp_acc, naming_rt, comp_rt 8 Reg/Class 0/8 All failed (R² < 0)
D: Language → Speech naming_acc, comp_acc, naming_rt, comp_rt → speech_rate/clarity/fluency 4 Regression 0/4 All failed (R² < 0)
F: Multi-modal Fusion (56 GM + 56 WM) + 154 speech + naming/comp → cognitive_risk 5 Mixed 1/5 Cognition (AUC=0.695) ✓
G: Simple Baseline age + sex → age_group, brain_health, speech_disfluency 3 Reg/Class 0/3 All failed (corrected)
⚠️ Critical finding: Speech fluency and language tasks involve different neural mechanisms - no correlation found (R² < 0 for all 12 bidirectional tasks).

Overall Scenario 2: 2/40 tasks succeeded (5.0%) | 428 experiments (40 tasks × 11 models)

Methodological note: Results affected by 5 types of data leakage; extensive corrections applied in 3-round audit

Scenario 3: Rigorous Validation

194 Participants
10 Tasks
~3500 Models Fitted

Data Structure: Long format with 10,864 rows → 194 participants × 230+ variables

Complete multi-modal: All sources merged (MRI + RAVLT + Language + Speech + Demographics)

Goal: Gold-standard methodology with 5×5 nested CV, permutation testing, multiple comparison correction

Three-Tier Experimental Design

Tier Purpose Method
Tier 1
Confirmatory
Verify Scenario 1 findings with strict controls Bonferroni correction
Permutation testing (200 iter)
Tier 2
Exploratory
Test new hypotheses on learning efficiency FDR correction
95% confidence intervals
Tier 3
Fusion
Test multi-modal integration benefits Nested cross-validation
Multiple random seeds

Enhanced Rigor vs. Scenarios 1-2

✓ 5×5 nested cross-validation (vs basic 5-fold)

✓ Pre-planned corrections (vs post-hoc)

✓ Pre-emptive leakage prevention (vs post-detection)

Scenario 3: Results (10 tasks)

Tier Features Target N Result
Tier 1:
Confirmatory
A1,A2,A3,A4,A5,A8,B6 + naming_acc + comp_acc + MoCA + MMSE + age + edu A7 (delayed recall) 183 R² = 0.8148 ✓
MoCA + MMSE + age + sex + edu depression_risk (HADS_D ≥ 8) 183 AUC = 0.62 ✓
20 selected MRI regions (GM volumes) A7 (delayed recall) 183 R² = -0.03
20 MRI + A1,A2,A3,A4,A5,A8,B6 + naming/comp + demographics (33 total) A7 (delayed recall) 183 R² = 0.81 (no gain)
Tier 2:
Exploratory
learning_efficiency = (A5-A1)/4 learning_efficiency_score 194 R² = 0.41 ✓
naming_rt comprehension_rt (correlation) 189 r = 0.38 ✓
L_temporal_GM - R_temporal_GM (asymmetry) language_lateralization 189 R² < 0
age_group (55-65 vs 66-75 vs 76-89) A7, MoCA, naming (group comparisons) 194 No age effects
Tier 3:
Fusion
20 MRI + RAVLT + Language + 154 Speech (200+ total) A7 (delayed recall) 183 R² = 0.78 (worse)
20 MRI + pause_composite + speech_rate (22 total) depression_risk 183 AUC = 0.59 (worse)

Success rate: 4/10 (40.0%) | Pattern: Behavioral-only succeeds, MRI fails, multi-modal hurts

Key Scientific Findings

1. Behavioral > Brain Structure

Cognitive/behavioral features (RAVLT, language) consistently outperform brain morphometry for predicting cognitive outcomes in healthy aging.

2. Classification vs. Regression

MRI succeeds for discrete outcomes (age group: AUC=0.76) but fails for continuous prediction (all R² < 0).

3. Multi-modal Paradox

Adding MRI to behavioral features provides no benefit and often degrades performance due to overfitting (high p/n ratios).

Methodological Implications

Data Leakage is Pervasive

5 types identified in Scenario 2. Rigorous pre-emptive design essential.

Sample Size Constraints

N=183-232 limits feature capacity. More features ≠ better prediction.

Validation Rigor Matters

Nested CV + permutation testing + correction methods reduce false positives.

Best result: R²=0.81 for RAVLT A7 prediction using behavioral features only