Multi-Omics Integration with Autoencoders and Graph Neural Networks
Multi-evidence framework for biosynthetic pathway gene discovery via dual autoencoders and Graph Attention Networks
Overview
Computational framework for identifying biosynthetic pathway genes by integrating transcriptomic and metabolomic data through deep learning. The system triangulates three independent evidence sources—statistical correlation, autoencoder feature importance, and sequence homology—into a unified multi-evidence ranking to reduce false positives in candidate gene discovery.
Key Innovation: GNN classification accuracy validates that autoencoder-derived importance scores reflect genuine biological signal rather than noise.
Problem Statement
Discovering genes responsible for metabolite biosynthesis is challenging due to:
- Ultra-high dimensionality: Genomic (100K+ features) and metabolomic (10K+ features) data vastly exceed sample sizes
- Small sample regime: Typical cohorts have n=20-100 samples due to cost constraints
- False positive risk: Single-method approaches lack cross-validation across evidence types
- Nonlinear relationships: Gene-metabolite associations involve complex regulatory cascades
Pipeline Architecture (5 Stages)
Stage 1 — Data Preparation
- Log2 transformation of FPKM values + z-score standardization
- Metabolite intensity normalization
- Quality control visualizations and sample metadata generation
Stage 2 — Model Training
- Independent gene and metabolite autoencoder training → 64-dimensional latent representations
- Graph Attention Network (GAT) training on concatenated latent vectors with auxiliary tissue/geographic classification tasks
Stage 3 — Analysis & Visualization
- Dimensionality reduction: t-SNE, UMAP, PCA on latent embeddings
- Gene-metabolite correlation matrices across all sample conditions
Stage 4 — Feature Importance Extraction
Three complementary importance metrics:
- Reconstruction importance: Per-gene MSE contribution from autoencoder
- Gradient-based importance: Backpropagation through encoder layers
- GNN-to-gene mapping: Projects latent-space importance back to individual genes via correlation weights
Stage 5 — Multi-Evidence Ranking
Weighted rank normalization combining all evidence:
- 0.4 × Correlation score
- 0.3 × Autoencoder importance
- 0.2 × GNN importance
- 0.1 × BLAST/HMM homology bonus
Produces ranked gene lists with supporting documentation per gene family.
Methodology
Dual Autoencoder Architecture
Gene Autoencoder (545M parameters):
- Encoder: 132,129 → 8,192 → 1,024 → 128 → 64
- Decoder: 64 → 128 → 1,024 → 8,192 → 132,129
- Regularization: Dropout (0.3), Early Stopping (patience=10), 5-fold CV
Metabolite Autoencoder (30.8M parameters):
- Encoder: 6,980 → 1,024 → 256 → 64
- Decoder: 64 → 256 → 1,024 → 6,980
- Regularization: Dropout (0.3), Early Stopping (patience=10), 5-fold CV
Graph Attention Network
- Operates on concatenated 128-dim latent vectors (64 gene + 64 metabolite)
- Multi-task classification for tissue/geographic metadata
- Validation logic: High classification accuracy → latent representations capture real biological variation → AE importance scores are biologically meaningful
Multi-Evidence Scoring
- Rank-based normalization ensures fair contribution from evidence types at different scales
- Latent-correlation mapping preserves interpretability back to biological features
- Gene family integration supports post-hoc BLAST/HMM result incorporation
Key Features
- Modular pipeline: Run individual stages or complete workflows
- Hardware abstraction: Automatic detection of MPS (Apple Silicon), CUDA, or CPU
- Gene family integration: Post-hoc BLAST/HMM incorporation with batch processing
- Configurable weights: Customizable evidence combination via YAML config
- Checkpoint system: Intermediate model state preservation
- Publication-quality output: 300 DPI figures, CSV tables, Excel-formatted rankings
Results
Model Performance
- Gene AE captures major transcriptomic variation despite 6,619:1 feature-to-sample ratio
- Metabolite AE successfully compresses metabolomic profiles into meaningful latent space
- GAT classification validates biological signal in learned representations
Key Findings
- Large-scale autoencoders handle extreme feature-to-sample ratios (6,619:1) through extensive regularization
- Multi-evidence triangulation reduces false positives compared to single-method approaches
- GNN validation confirms that learned representations capture genuine biological structure
- Rank-based fusion fairly combines heterogeneous evidence types
Technical Stack
| Component | Technology |
|---|---|
| Language | Python 3.8+ |
| Deep Learning | PyTorch 2.0+ |
| Graph Networks | Graph Attention Network (GAT) |
| Hardware | MPS (Apple Silicon) / CUDA / CPU |
| Configuration | YAML-based pipeline management |
| Analysis | scikit-learn, scipy.stats (Spearman + FDR) |
| Visualization | t-SNE, UMAP, PCA (publication-quality) |
Privacy & Ethics
- No biological identifiers: Sample names, patient IDs removed
- Feature anonymization: Specific gene/metabolite names withheld in public repository
- Aggregated reporting: Only statistical summaries and framework architecture shared
- Collaborative approval: Data provider explicitly approved open-source release
Limitations & Future Work
Current Limitations
- Small sample size (n=21): Limits generalizability; recommend n≥100 for production
- Specific biological context: Associations may not generalize across tissues/conditions
- Computational cost: 545M-parameter models require GPU acceleration
Future Directions
- Scaling to larger cohorts (n≥100)
- Transfer learning with public datasets (TCGA, GTEx)
- Attention mechanisms for relation-specific interpretability
- Multi-modal extension (proteomics, epigenomics, clinical variables)
- Federated learning for multi-institution collaboration
Code Availability
GitHub Repository: omics-ae-gnn
Note: Specific biological data not included to protect collaborator privacy. Framework designed for generalizability across datasets.
Project Status: Ongoing (December 2025 - Present) Technologies: PyTorch, GAT, Python Category: AI for Health, Computational Biology, Biosynthetic Pathway Discovery