PrimeKG GNN Drug-Disease Link Prediction
Multi-architecture GNN benchmark for computational drug repurposing on biomedical knowledge graphs
Overview
Comprehensive biomedical link prediction framework comparing 6 GNN architectures (RGCN, GCN, GAT, GraphSAGE, GIN, MLP) on the PrimeKG knowledge graph for predicting drug-disease therapeutic indications. Version 2.0 features a critical data leakage fix (71.5% → 0% leakage) with undirected-edge-aware splitting and hard negative evaluation. Best model (GAT) achieves 0.9866 AUC-ROC under strict evaluation.
Key Finding: Attention mechanisms (GAT) outperform explicit relation-type modeling (RGCN) for drug-disease link prediction.
Problem Statement
Drug discovery is time-consuming and expensive. Knowledge graph-based link prediction can identify novel drug-disease associations by analyzing multi-hop relationships between drugs, diseases, genes, and proteins, accelerating drug repurposing and precision medicine.
Methodology
PrimeKG Knowledge Graph
- Source: 4.5M relationships from 20+ biomedical databases (DrugBank, OMIM, UniProt, Reactome)
- Processed Graph: 30,926 nodes (6,282 drugs, 5,593 diseases, 19,051 genes/proteins)
- Edges: 849,456 across 3 relation types (drug-gene, gene-gene, gene-disease)
- Split: 70% train (838,882) / 15% val (7,688) / 15% test (7,708)
Model Architectures
- GAT: Graph Attention Network — learned attention weights
- RGCN: Relational GCN — relation-specific convolutions
- GIN: Graph Isomorphism Network — maximally expressive message passing
- GraphSAGE: Inductive sampling-based aggregation
- GCN: Standard graph convolution baseline
- MLP: Non-graph baseline
Shared Configuration: 128-dim hidden, 64-dim embeddings, DistMult decoder, dropout 0.5, 50 epochs, batch size 2048
Data Integrity
- Version 2.0: Undirected-edge-aware splitting eliminates data leakage (71.5% → 0%)
- Hard Negative Sampling: 50 negatives per positive for strict evaluation
Results
Strict Evaluation (Hard Negatives, 50 per positive)
| Model | AUC-ROC | AP | Hits@10 | MRR |
|---|---|---|---|---|
| GAT | 0.9866 | 0.7955 | 0.9831 | 0.9031 |
| RGCN | 0.9794 | 0.6697 | 0.9789 | 0.8699 |
| GIN | 0.9774 | 0.6522 | 0.9757 | 0.8613 |
| GraphSAGE | 0.9742 | 0.6394 | 0.9789 | 0.8862 |
| GCN | 0.9657 | 0.5538 | 0.9724 | 0.8640 |
| MLP | 0.6592 | 0.0314 | 0.5234 | 0.2710 |
Key Insights
- Graph structure is critical: All GNNs vastly outperform MLP baseline
- Attention > Relation modeling: GAT outperforms RGCN across all metrics
- Principled evaluation matters: Data leakage fix revealed true model capabilities masked by loose protocols
Advanced Analyses
- Path-based explanations with NLP-generated summaries for interpretability
- Disease-specific case studies with drug repurposing predictions
- Embedding visualization (t-SNE/UMAP) with clustering analysis
- Error pattern analysis and failure mode characterization
- Biological plausibility validation with evidence gathering
- Baseline comparisons (random, degree-based, TransE)
- Confidence calibration and performance breakdowns by node type
Technical Stack
| Component | Technology |
|---|---|
| Deep Learning | PyTorch ≥2.0 |
| Graph Networks | PyTorch Geometric ≥2.3 |
| Graph Algorithms | NetworkX ≥2.8 |
| Visualization | Matplotlib, Seaborn, Plotly, UMAP |
| Metrics | scikit-learn ≥1.3 |
| Hardware | CUDA 11.0+ / Apple MPS / CPU |
Project Structure
src/
├── run_full_analysis.py # Main orchestrator
├── train.py # Multi-model training
├── strict_evaluation.py # Hard negative sampling
├── case_studies.py # Disease-specific predictions
├── explain_predictions.py # Path-based reasoning
├── medical_validation.py # Biological plausibility
├── visualize_embeddings.py # t-SNE/UMAP
├── error_analysis.py # Error patterns
├── compare_methods.py # Baseline comparison
└── models/ # Modular encoder registry
├── rgcn.py, gcn.py, gat.py, graphsage.py, gin.py, mlp.py
Applications
- Drug repurposing for new therapeutic indications
- Disease mechanism identification via multi-hop reasoning
- Drug-target interaction prediction
- Experimental candidate prioritization
- Precision medicine research
Limitations & Future Work
Current Limitations:
- Strict evaluation reveals ranking challenges masked by loose protocols
- Limited to 3 relation types from PrimeKG subset
- Case study validation limited to literature evidence
Future Directions:
- Full PrimeKG integration (all relation types)
- Temporal knowledge graph evolution
- Clinical trial outcome integration
- External validation against DrugBank updates
GitHub Repository
Timeline
Duration: October 2025 - Present (Independent research project)
- Phase 1: Single RGCN model with optimization (Oct-Nov 2025)
- Phase 2: Multi-architecture comparison, data leakage fix, strict evaluation (2026)