PrimeKG-RGCN Drug-Disease Link Prediction

Graph neural networks for computational drug discovery

Overview

Relational Graph Convolutional Networks (R-GCN) for link prediction on PrimeKG biomedical knowledge graph (30,926 nodes, 849,456 edges). Achieves 0.9781 AUC-ROC for predicting drug-disease associations, enabling computational drug repurposing and discovery.

Problem Statement

Drug discovery is time-consuming and expensive. Knowledge graph-based link prediction can identify novel drug-disease associations by analyzing relationships between drugs, diseases, genes, and proteins, accelerating drug repurposing and precision medicine.

Methodology

PrimeKG Knowledge Graph

  • Nodes: 30,926 (6,282 drugs, 5,593 diseases, 19,051 genes/proteins)
  • Edges: 849,456 across 3 relation types (drug-gene, gene-gene, gene-disease)
  • Sources: 20+ biomedical databases (DrugBank, OMIM, UniProt, Reactome)

R-GCN Architecture

  • Input: PrimeKG subgraph
  • Layers: 2 R-GCN layers (relation-specific convolutions, dropout 0.3, ReLU)
  • Embeddings: 128-dim entity vectors
  • Decoder: DistMult bilinear scoring function
  • Training: PyTorch Geometric, NVIDIA GTX 1070, 1024-edge batches, 100 epochs

Results

Link Prediction Performance:

  • AUC-ROC: 0.9781 AUC-PR: 0.9663 F1-Score: 0.9526
  • Hits@10: 4.1% Hits@50: 15.5% MRR: 0.0187

Analysis: Excellent binary classification; ranking metrics show room for improvement (hub bias, graph sparsity)

Applications

  • Drug repurposing for new indications
  • Disease mechanism identification
  • Drug-target interaction prediction
  • Experimental candidate prioritization
  • Precision medicine research

Limitations & Future Work

Limitations: GTX 1070 memory constraints (8GB VRAM), ranking metrics lower than classification metrics, limited to 3 relation types

Future Directions:

  • Multi-hop pathway analysis
  • Biological plausibility scoring
  • Clinical validation of predictions

Achievements & Recognition

Key Metrics

  • 0.9781 AUC-ROC for drug-disease link prediction
  • Complete pipeline: preprocessing → training → evaluation
  • GPU-accelerated batch processing (1024 edges/batch)

Technical Stack

PyTorch Geometric, NetworkX, scikit-learn, pandas, matplotlib

Timeline

Duration: October 25, 2025 - Present (Independent research project)