PrimeKG-RGCN Drug-Disease Link Prediction
Graph neural networks for computational drug discovery
Overview
Relational Graph Convolutional Networks (R-GCN) for link prediction on PrimeKG biomedical knowledge graph (30,926 nodes, 849,456 edges). Achieves 0.9781 AUC-ROC for predicting drug-disease associations, enabling computational drug repurposing and discovery.
Problem Statement
Drug discovery is time-consuming and expensive. Knowledge graph-based link prediction can identify novel drug-disease associations by analyzing relationships between drugs, diseases, genes, and proteins, accelerating drug repurposing and precision medicine.
Methodology
PrimeKG Knowledge Graph
- Nodes: 30,926 (6,282 drugs, 5,593 diseases, 19,051 genes/proteins)
- Edges: 849,456 across 3 relation types (drug-gene, gene-gene, gene-disease)
- Sources: 20+ biomedical databases (DrugBank, OMIM, UniProt, Reactome)
R-GCN Architecture
- Input: PrimeKG subgraph
- Layers: 2 R-GCN layers (relation-specific convolutions, dropout 0.3, ReLU)
- Embeddings: 128-dim entity vectors
- Decoder: DistMult bilinear scoring function
- Training: PyTorch Geometric, NVIDIA GTX 1070, 1024-edge batches, 100 epochs
Results
Baseline Performance
Link Prediction Performance:
-
AUC-ROC: 0.9696 AUC-PR: 0.9663 F1-Score: 0.9526 -
Hits@10: 49.1% Hits@50: 15.5% MRR: 0.8027 - Mean Rank: 493.53
Analysis: Strong binary classification; ranking metrics showed potential for optimization
Phase 1 Optimization (November 2025)
Architecture Enhancements:
- LayerNorm integration after R-GCN convolutions for training stability
- Skip connections between convolutional layers for better gradient flow
- Embedding caching to eliminate redundant computations
- Vectorized batch operations replacing iterative processing
Performance Improvements:
- Classification: AUC-ROC improved by 2.98% (0.9696 → 0.9985)
- Ranking: MRR increased by 19.73% (0.8027 → 0.9611), Hits@10 up 11.89% (49.1% → 61.0%)
- Efficiency: Mean Rank decreased 88.10% (493.53 → 58.75)
- Speed: Evaluation accelerated 75×, from ~300s to 4s
Trade-offs: Precision marginally decreased by 1.07%, but recall improved by 6.94%, resulting in superior overall F1-scores
Applications
- Drug repurposing for new indications
- Disease mechanism identification
- Drug-target interaction prediction
- Experimental candidate prioritization
- Precision medicine research
Limitations & Future Work
Current Limitations:
- Phase 1: ~50.9% of true edges still fall outside top 10 predictions
- GTX 1070 memory constraints (8GB VRAM)
- Limited to 3 relation types
Phase 2 Directions:
- Further ranking optimization for top-k predictions
- Multi-hop pathway analysis
- Biological plausibility scoring
- Clinical validation of predictions
Achievements & Recognition
Key Metrics
- Phase 1: 0.9985 AUC-ROC for drug-disease link prediction (post-optimization)
- Performance: 88% reduction in mean rank, 75× evaluation speedup
- Complete pipeline: preprocessing → training → evaluation → optimization
- GPU-accelerated batch processing (1024 edges/batch)
Technical Stack
PyTorch Geometric, NetworkX, scikit-learn, pandas, matplotlib
GitHub Repository
PrimeKG-RGCN-LinkPrediction | Phase 1 Comparison Results
Timeline
Duration: October 25, 2025 - Present (Independent research project) Phase 1 Optimization: November 2025 - December 2025