Can Passive Sensing Replace Questionnaires for Mental Health Prediction?

Status: Manuscript under review at IEEE Journal of Biomedical and Health Informatics (JBHI).

Overview

A three-study head-to-head comparison across three universities (N=1,559) testing whether weeks of continuous passive smartphone and wearable sensing can replace brief personality questionnaires for predicting mental health and academic performance. The central finding: two BFI items (10 seconds) outperform 28 sensing features collected over weeks, and personality questionnaires win 14 of 15 outcome comparisons (93%). The result holds across traditional ML, deep learning (1D-CNN), and time-series foundation models (MOMENT).

Research Question

Despite a decade of investment in passive sensing for mental health, the field’s central premise — that behavioral signals from phones and wearables can replace or augment self-report — has never been rigorously stress-tested at scale against the strongest available baseline. This project asks the head-to-head question: under what conditions, if any, does passive sensing add value beyond a brief personality questionnaire?

Studies

	Study 1: StudentLife	Study 2: NetHealth	Study 3: GLOBEM
University	Dartmouth (2013)	Notre Dame (2015–2019)	U. Washington (2018–2021)
N	28	722	809
Personality	BFI-44	BFI-44	BFI-10
Sensing	13 modalities, 87 features	Fitbit + comm logs, 28 features	Fitbit + phone + GPS, 19 features + 2,597 RAPIDS
MH outcomes	PHQ-9, PSS, Loneliness, Flourishing, PANAS	CES-D, STAI, BAI	BDI-II, STAI, PSS-10, CESD, UCLA
Academic	GPA	GPA	—

Together: 3 universities, 3 time periods (2013–2021), 15 outcomes, 4 ML algorithms plus deep learning, and 41 robustness analyses.

Key Results

Questionnaires dominate at the population level

Personality wins 14/15 outcome comparisons (93%), mean R² = 0.126 vs. sensing mean R² = −0.153
Two BFI items (10 seconds, R² = 0.36 for CES-D) outperform 28 sensing features collected over weeks (R² = −0.16)
Neuroticism ranks #1 SHAP feature in 28/28 mental health models across all three studies
Conscientiousness ranks #1 for GPA in 8/8 models
Deep learning cannot rescue sensing: 1D-CNN R² = −0.03 to −0.10; MOMENT foundation-model embeddings R² = −1.0 to −1.7
Sensing features are highly reliable (ICC(3,k) = 0.73–0.98) — the problem is construct relevance, not measurement quality

Sensing has value under specific conditions

Condition	Evidence	Effect size
Lagged early warning	Autoregressive + sensing beats autoregressive alone	+0.031 R²
Communication metadata	SMS/call logs improve depression prediction (S2)	+0.030 R²
Sleep + nonlinear models	RF captures sleep–anxiety link (S2)	+0.055 R²
Idiographic monitoring	17% of individuals show person-specific R² > 0.3	Variable
Engagement signal	Device non-wear correlates with anxiety	r = −0.12
Clinical classification	Pers + Beh AUC over Pers-only (S2)	+0.06–0.08 AUC

Reframing passive sensing

Within-person centered R² ≈ 0 across 3,149 person-weeks of weekly PHQ-4 and EMA. But per-person mean

= 0.33–0.35 between mood and concurrent sensing, with 17% of individuals showing idiographic R² > 0.3. The implication: passive sensing is not a nomothetic screening tool (one model fits all) but a potential idiographic monitoring tool requiring individual calibration — fundamentally changing the field’s value proposition.

Methodology

Pipeline

Feature extraction across 13 modalities for Study 1 (87 features); Fitbit + communication for Study 2 (28); Fitbit + phone + GPS for Study 3 (19 curated + 2,597 RAPIDS)
4 ML algorithms in parallel: Elastic Net, Ridge, Random Forest, SVR; plus MLP with Optuna, 1D-CNN, MOMENT foundation model, and stacking ensembles
5-fold cross-validation with nested CV for hyperparameter tuning; FDR correction across all tests
SHAP analysis for cross-model feature importance; LPA, mediation, and PLS-SEM for Study 1 supplementary

Robustness

44 supplementary checks covering reliability, ablation, RAPIDS comparison, idiographic prediction, missingness-as-signal, dose–response (7–92 days), within-person tracking, prospective change, cross-study transfer, residualized prediction, reverse prediction, demographic controls, COVID exclusion, and more.

Reproducibility

Three universities, four cohorts (GLOBEM INS-W_1 through INS-W_4), two BFI variants, seven mental health instruments — convergent findings across every dimension tested.

Practical Implication

Screen with a brief questionnaire (2–5 items, 10 seconds–1 minute); deploy sensing only for high-risk individuals where personalized monitoring may add idiographic value.

This inverts the dominant deployment model in digital mental health, which treats sensing as the primary signal and questionnaires as ground truth.

Technical Stack

Component	Technology
Language	Python 3.11+
ML	scikit-learn, XGBoost, statsmodels
Deep Learning	PyTorch 2.11+ (MPS), 1D-CNN, MOMENT foundation model (`momentfm`)
Interpretability	SHAP, permutation importance, PLS-SEM, LPA (Gaussian mixture)
Statistics	FDR correction, bootstrap mediation, random-effects meta-analysis
Data	StudentLife, NetHealth, GLOBEM (4 cohorts)

Project Structure

Five-layer script organization: shared utilities → data preparation → core analyses → robustness checks (44 supplementary analyses) → publication materials. Seven-chapter report covering each study individually, cross-study synthesis, clinical utility, and grand synthesis.

Limitations

All three datasets are college-age cohorts; generalization to clinical or older populations is untested
“Personality wins” reflects between-person prediction at the population level; idiographic deployment remains an open opportunity
Sensing modalities are heterogeneous across studies (different devices, sampling rates, derived features)
28-person Study 1 result for PHQ-9 (sensing R² = 0.468) does not replicate at scale — small-N overfitting cautionary tale

Supervision

Supervisor: Asst Prof Cyrus Ho Su Hui, Department of Psychological Medicine, Yong Loo Lin School of Medicine, National University of Singapore. Assistant Dean (Student Life and Wellbeing), NUS Graduate School; Senior Consultant Psychiatrist, NUH; Clarivate Highly Cited Researcher (2021–2023).

GitHub Repository

sensing-vs-questionnaires-analysis

Timeline

Duration: November 2025 – April 2026 (Independent research project). Manuscript submitted to IEEE JBHI; currently under review.