The Goal
The point is not the pipeline — it's what happens when you sit down to write. I wanted to be able to ask a question about my field and get back a real answer — grounded in my own papers, cross-referenced against the broader literature, with actual citations I could verify. Not "the literature suggests," but a synthesis I could trust enough to put in a grant.
That's what this system does. A Claude Code session can search semantically across ~1,000 full-text scientific papers, query the entire PubMed database, cross-reference findings, identify gaps in the evidence, and recommend specific papers to add. It turns grant writing from a memory exercise into a conversation with an AI that has actually read your papers.
What a Research Session Looks Like
Say I'm writing a specific aims page and I need to know: "What's the evidence that FGF21 acts in the hindbrain vs. the hypothalamus?" The system searches my vault semantically, finds the relevant papers, reads their key findings, checks PubMed for recent work I might not have, synthesizes across sources with specific citations, flags where the evidence is contested, and tells me which papers I should add to strengthen the argument.
The output has passage-level citations — not "the literature suggests" but "Laeger et al. (2014, JCI) showed that..." I use this workflow for R01 grant writing, manuscript drafting, and literature review. It's the closest thing I've found to having a collaborator who has read everything and remembers all of it.
The Infrastructure
PDFs go in, searchable knowledge comes out. The pipeline runs automatically — conversion, enrichment, indexing, mapping — so the knowledge base stays current without manual effort.
Interactive Research Session
- Semantic search across full-text vault
- PubMed queries across the entire literature
- Cross-paper synthesis with specific citations
- Gap analysis — what's missing from the collection?
- Fact-checking claims against primary sources
PDF/DOCX
~1,000 papers
Convert & Enrich
- PDF → Markdown
- PubMed enrich
- Deduplication
Index & Map
- Semantic search
- 35+ topic clusters
- LLM key findings
- Cross-references
- Synthesis engine
How Papers Are Understood
Each paper gets more than keyword indexing. An LLM reads the full text and extracts specific key findings — stated as precise experimental results, not the vague summaries you get from abstracts — along with factual tags and relationships to other papers in the collection. The goal is that when I ask a question, the system can point to the actual result in the actual paper, not a paraphrase.
Tags aren't imposed from a fixed taxonomy. They emerge from the papers themselves and reuse existing vocabulary when appropriate. Over 35 topic clusters evolve automatically — a nightly review process proposes splits, merges, and new clusters, with human approval before changes take effect. The knowledge map grows with the collection.
Where New Papers Come From
The vault is fed from two sources: my Endnote library (a career's worth of papers, bulk-imported) and an autonomous agent that surfaces new papers weekly via the Research Digest. When something on the digest catches my eye, I can select it for full-text retrieval — and once it's in the vault, it's automatically converted, enriched, indexed, and available in the next research session.
The goal is to close this loop completely: the agent finds a paper, I approve it, and full text flows straight into the vault. Today there's still a manual export step, and full-text access is unreliable for paywalled journals (publishers really don't want you automating this). But the architecture is designed so that when reliable full-text access exists, papers flow from discovery to indexed knowledge without intervention.
The Inbox Pipeline
A separate but related system for the ideas that hit at inconvenient times. I capture a quick thought on my phone — a project idea, a connection between papers, something I overheard at a seminar — and it lands in a staging area. Every other day, an LLM picks up each capture, researches it (competitive landscape, technical feasibility, connections to things I'm already working on), and writes a review document.
I read the review, mark which items to act on, and the pipeline executes. Same security compartmentalization as the agent architecture: the stage that reads the internet can't modify my files, and the stage that writes to the knowledge base has no internet access. Paranoid? Maybe. But it means I can let the system research freely without worrying about what it might overwrite.