PubMed Knowledge Graph

66.2 million nodes. 1.04 billion edges. Every article published in PubMed since 1966.

We loaded the entire PubMed/MEDLINE corpus -- 37 million articles, 30 million authors, 70 million citations -- then asked:

"What are the top MeSH terms co-occurring with cancer research?"

MATCH (a:Article)-[:ANNOTATED_WITH]->(m1:MeSHTerm {name: 'Neoplasms'})
MATCH (a)-[:ANNOTATED_WITH]->(m2:MeSHTerm)
WHERE m2.name <> 'Neoplasms'
RETURN m2.name, count(DISTINCT a) AS articles
ORDER BY articles DESC LIMIT 5

MeSH Term	Articles
Humans	513,845
Female	298,412
Male	276,891
Animals	189,234
Middle Aged	156,728

One billion edges. One query. One machine. Powered by Samyama Graph.

See all 100 benchmark queries →

Schema

6 node labels -- Article (37M), Author (30M), MeSHTerm (30K), Chemical (500K), Journal (30K), Grant (1M)

6 edge types -- AUTHORED_BY (150M), ANNOTATED_WITH (400M), MENTIONS_CHEMICAL (126M), PUBLISHED_IN (37M), CITES (70M), FUNDED_BY (5M)

Data source -- PubMed/MEDLINE baseline from NLM (1,219 XML files, 101 GB compressed)

Quick Start

Load from snapshot (recommended)

# Download snapshot from release
curl -LO https://github.com/samyama-ai/samyama-graph/releases/download/kg-snapshots-v5/pubmed.sgsnap

# Start Samyama and import
./target/release/samyama
curl -X POST http://localhost:8080/api/tenants \
  -H 'Content-Type: application/json' \
  -d '{"id":"pubmed","name":"PubMed KG"}'
curl -X POST http://localhost:8080/api/tenants/pubmed/snapshot/import \
  -F "file=@pubmed.sgsnap"

Build from source (requires AWS -- 101 GB download)

git clone https://github.com/samyama-ai/pubmed-kg.git && cd pubmed-kg
pip install -e .
python etl/download_pubmed.py --output-dir data/pubmed-raw    # 101 GB
python etl/parse_pubmed_xml.py data/pubmed-raw/ --output-dir data/pubmed
cargo run --release --example pubmed_loader -- --data-dir data/pubmed

Example Queries

-- Most-cited articles
MATCH (a:Article)<-[:CITES]-(citing:Article)
RETURN a.title, count(citing) AS citations ORDER BY citations DESC LIMIT 10

-- Cross-KG: PubMed articles linked to clinical trials
MATCH (a:Article)-[:REFERENCED_IN]->(t:ClinicalTrial)-[:TESTS]->(i:Intervention)
RETURN i.name, count(DISTINCT a) AS articles ORDER BY articles DESC LIMIT 10

Part of the Biomedical Trifecta

PubMed is the backbone of Samyama's billion-edge biomedical benchmark, federated with Clinical Trials (27M edges), Pathways (835K edges), and Drug Interactions (388K edges). Together: 74M nodes, 1 billion edges, 96/100 queries passing.

Links


Samyama Graph	github.com/samyama-ai/samyama-graph
The Book	samyama-ai.github.io/samyama-graph-book
Benchmark (100 queries)	Biomedical Benchmark
Contact	samyama.dev/contact

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
etl		etl
mcp_server		mcp_server
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PubMed Knowledge Graph

Schema

Quick Start

Load from snapshot (recommended)

Build from source (requires AWS -- 101 GB download)

Example Queries

Part of the Biomedical Trifecta

Links

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PubMed Knowledge Graph

Schema

Quick Start

Load from snapshot (recommended)

Build from source (requires AWS -- 101 GB download)

Example Queries

Part of the Biomedical Trifecta

Links

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages