LitScribe — AI Literature Review with Citation Grounding & Contradiction Detection
A deterministic 11-step pipeline with a DeepAgents supervisor that produces verified, contradiction-aware literature reviews from a research question in about two minutes
Overview
LitScribe is an AI literature review engine that takes a research question and returns a 1,500–2,000 word review with verified citations, methodology comparison, research timeline, and BibTeX export — typically in about two minutes. Unlike chat-based tools that hallucinate citations, every [@key] reference is verified against the source paper. Unlike summarizers, the system detects contradictions across papers and presents them as critical analysis.
What Makes It Different
Three contributions not present in any existing tool (ChatGPT, Elicit, PaperQA2, SurveyG):
- Contradiction-aware synthesis — existing tools either generate reviews (ignoring contradictions) or detect contradictions (without writing reviews). LitScribe does both end-to-end: detect → inject into narrative → present as critical analysis.
- Metacognitive quality loop — after self-review, the system evaluates which pipeline steps failed and autonomously re-runs them with adjusted strategy. Strategy decisions are persisted to a knowledge base for future reviews.
- Search-augmented refinement — when users say “add a section about X”, LitScribe searches for new papers on X first, analyzes them, then writes the section with real citations from the newly found papers — never hallucinating.
Architecture
DeepAgents Supervisor + Deterministic Pipeline
A natural-language supervisor (built on DeepAgents, 7 tools) interprets user intent, then dispatches a deterministic 10-step pipeline with a metacognitive quality loop:
sequenceDiagram
actor User
participant S as Supervisor<br/>(DeepAgents)
participant P as Pipeline
User->>S: "Review CRISPR knockout strategies in CHO cells"
S->>S: Understand intent + confirm
S->>P: run_review(question, 12 papers)
rect rgb(240, 248, 255)
Note over P: Deterministic Pipeline
P->>P: 1. Plan (decompose sub-topics)
P->>P: 2. Search (6 academic databases)
P->>P: 3. Unpaywall (OA full-text lookup)
P->>P: 4. Read (analyze each paper)
P->>P: 5. Contradictions (pairwise)
P->>P: 6. GraphRAG (knowledge graph)
P->>P: 7. Synthesize (parallel sections)
P->>P: 8. Debate (reviewer ↔ synthesizer)
P->>P: 9. Ground (verify citations)
P->>P: 10. Self-review (quality score)
end
alt Score < 0.65
P->>P: Metacognition → re-run failed steps
end
P->>P: Save (session + knowledge base)
P-->>S: Review + References + BibTeX
S-->>User: 1,500-word review, score 0.82
Each numbered step is independently testable and persists to a SQLite knowledge base, so reviews can resume after interruption and downstream steps reuse upstream results across sessions.
Six-Source Academic Search
The Search step (step 2 in the pipeline) fans out across six databases in parallel, then funnels results through dedup → keyword filter → LLM selection → open-access PDF lookup:
graph TB
Q[Research Question] --> PLAN[LLM Plan<br/>sub-topics + queries]
PLAN --> |parallel| OA[OpenAlex<br/>250M+]
PLAN --> |parallel| EPM[Europe PMC<br/>40M+]
PLAN --> |parallel| PM[PubMed<br/>35M+]
PLAN --> |parallel| S2[Semantic Scholar<br/>200M+]
PLAN --> |parallel| AR[arXiv<br/>2M+]
PLAN --> |parallel| CR[CrossRef<br/>140M+]
OA --> DEDUP[DOI Dedup]
EPM --> DEDUP
PM --> DEDUP
S2 --> DEDUP
AR --> DEDUP
CR --> DEDUP
DEDUP --> KW[Keyword Filter]
KW --> LLM[LLM Selection<br/>top N relevant]
LLM --> UPW[Unpaywall<br/>OA PDF lookup]
UPW --> OUT[Papers<br/>ready for analysis]
style OA fill:#e8f5e9
style EPM fill:#e8f5e9
style PM fill:#e3f2fd
style S2 fill:#e8f5e9
style AR fill:#fff3e0
style CR fill:#e8f5e9
| Source | Coverage | Domain |
|---|---|---|
| OpenAlex | 250M+ | All |
| Semantic Scholar | 200M+ | All |
| CrossRef | 140M+ | All (incl. Chinese) |
| Europe PMC | 40M+ | Life science |
| PubMed | 35M+ | Biomedical |
| arXiv | 2M+ | CS/Physics/Math |
Domain-aware routing auto-skips arXiv for biology/medicine; LLM-based selection filters irrelevant results after DOI deduplication.
Features
Citation Grounding
Every [@key] is verified against the source paper’s actual findings. Unsupported claims are auto-detected and rewritten. Grounding accuracy: 83–100% across benchmark domains.
Contradiction Detection
Pairwise comparison surfaces opposing findings, which are then injected into the review as critical commentary:
[@cho2025]: “FUT8 knockout had no negative effect on cell growth”[@lin2020]: “FUT8 knockout reduced cell viability by 40%” → classified asopposing_conclusions(major) → integrated into discussion
Multi-Agent Debate
After synthesis, a reviewer agent critiques the draft → synthesizer revises → up to 2 rounds. Catches unsupported claims, missing perspectives, and weak synthesis.
Local Paper Modes
-
litscribe draft my_review.md paper1.pdf paper2.pdf— analyze your draft against references -
litscribe outline *.pdf refs.bib— “what review could I write from these?” -
litscribe augment "CRISPR delivery" papers/*.pdf— your papers + online search
Natural Language Interface
The supervisor understands intent in English or Chinese and routes to the right tool:
"写个关于transformer的综述,3个主题" → run_review (Chinese)
"加一段关于delivery methods" → refine_review (searches new papers first)
"这篇综述适合什么人看" → assess_reading_level
Cross-Lingual
Direct generation in target language (not post-translation), CJK-aware word counting, mismatch detection.
Security
HMAC integrity on the knowledge base to detect data poisoning, prompt injection filter, XSS prevention, rate limiting, PDF URL domain whitelist, parameterized SQL.
Review Output
Every generated review includes:
- Thematic analysis with
[@key]Pandoc citations - BibTeX-compatible reference list
- Methodology Comparison table across all papers
- Research Timeline (foundation → development → frontier)
- Statistical Summary — extracted p-values, effect sizes, sample sizes
- Suggested Figures — recommended visualizations with placement notes
Interfaces
- CLI — 12 commands (
chat,review,draft,outline,augment,sessions,export,evaluate,serve,skills, …) - Web UI + REST API — 16 endpoints including SSE streaming for the review pipeline
- Sessions + Knowledge Base — persistent across reviews; supports diff-based refinement
Benchmark
| Domain | Score | Papers | Words | Grounding | Time |
|---|---|---|---|---|---|
| Biology (CHO CRISPR) | 0.82 | 9 | 1,649 | 83% | 106 s |
| CS (LLM reasoning) | 0.72 | 12 | 1,535 | 63% | 121 s |
| Medicine (scRNA-seq TME) | 0.65 | 12 | 1,709 | 68% | 115 s |
| CS (transformer attention) | 0.55 | 12 | 1,643 | 62% | 153 s |
| Chemistry (sesquiterpene) | 0.55 | 3 | 1,338 | 100% | 103 s |
Self-review score 0–1; grounding = % citations verified against source paper.
Technical Stack
| Component | Technology |
|---|---|
| Language | Python 3.12+ |
| Supervisor | DeepAgents (7 tools) |
| LLM | Any OpenAI-compatible endpoint (DeepSeek, Qwen, Claude, …) |
| Knowledge Graph | NetworkX + Leiden community detection |
| Embeddings | sentence-transformers |
| PDF Processing | pymupdf4llm (default), marker-pdf (OCR) |
| Storage | SQLite (sessions, knowledge base with HMAC, vectors) |
| Backend | FastAPI + SSE streaming |
| Codebase | 86 files, 10,300 lines, 49 tests |
Use Cases
- Researchers — rapid literature surveys; contradiction-aware systematic review preparation; gap identification
- Graduate students — thesis literature chapters; state-of-the-art mapping; seminal paper discovery
- Industry R&D — technical due diligence; competitive landscape from academic publications
Limitations & Roadmap
- Requires API keys for cloud LLM functionality
- PDF extraction quality varies with document formatting
- Grounding accuracy degrades in highly technical domains (62–68% in some CS/medicine cases)
Planned: local LLM support (Ollama/MLX/vLLM); subscription system with daily digest; richer Web UI.
Publication
arXiv preprint (NeurIPS-style template, 10 pages): case study, ablation, algorithm details, discussion. Draft available in the repository.
GitHub Repository
Timeline
Status: Active development (January 2026 – present)