Transcribe audio files with Whisper, summarize into a 3-level hierarchical topic outline with an LLM, and store everything in a local SQLite database. Includes a semantic visualization layer (UMAP 3-D scatter, agglomerative clustering, cosine-similarity search) and a RAG scaffold (ingest → embed → retrieve → generate) exposed via CLI and HTTP API.
📖 Hosted docs: https://autologger.github.io/AutoRAG/ — built from docs/
with Sphinx and published to GitHub Pages on every push to main
(.github/workflows/docs.yml).
To build the same site locally:
uv sync --group docs
uv run make -C docs html
# open docs/_build/html/index.htmlThe site includes a user guide (installation, quickstart, per-feature
walkthroughs), an internals reference (architecture, extras model,
audio pipeline design, Ollama tuning, frontend, packaging), and an
autodoc-generated API reference for every module under
src/autorag/.
# Install full stack (audio + diarization + RAG + server + YouTube download)
uv sync --all-extras
# Transcribe a local audio file (saves word spans to SQLite)
autorag transcribe session.webm
# Run the full pipeline: Whisper + LLM topics + Chroma embeddings
autorag generate-topics session.webm
# …or a YouTube URL — yt-dlp downloads the audio to a temp .webm first
autorag generate-topics https://www.youtube.com/watch?v=dQw4w9WgXcQgenerate-topics prints the persisted topic JSON to stdout; timing info goes to stderr. The database is written to ~/.autorag/autorag.db by default.
AutoRAG is also a pip-installable SDK. Install from a tagged release on GitHub:
# Audio → topics agent only (Whisper + diarization)
pip install "autorag[audio,diarize] @ git+https://github.com/AutoLogger/AutoRAG@v0.4.0"
# Add YouTube URL support (yt-dlp)
pip install "autorag[audio,diarize,youtube] @ git+https://github.com/AutoLogger/AutoRAG@v0.4.0"
# Full stack (also installs Chroma + UMAP + FastAPI)
pip install "autorag[all] @ git+https://github.com/AutoLogger/AutoRAG@v0.4.0"from autorag import AutoRAG
rag = AutoRAG()
# Step 1: Whisper + diarization → word spans
words = rag.transcribe("meeting.wav")
# Or a YouTube URL (requires [youtube] extra):
words = rag.transcribe("https://youtu.be/dQw4w9WgXcQ")
# Step 2: LLM topic extraction (requires [audio,diarize] for LangChain/Ollama)
topics = rag.generate_topics(words)
print(topics["topics"]) # hierarchical topic tree (L0/L1/L2)
# Step 3a: persist word spans to SQLite (requires [rag] extra)
rag.persist_transcription("meeting.wav", words, title="Weekly sync")
# Step 3b: persist topic tree + embed titles into Chroma (requires [rag] extra)
rag.persist_topics("meeting.wav", topics, words=words, title="Weekly sync")| Extra | Adds | Use when you want… |
|---|---|---|
audio |
whisperx, torch, imageio-ffmpeg | …to call rag.transcribe() / rag.build_agent() |
diarize |
pyannote.audio, huggingface-hub | …speaker labels (combine with audio) |
youtube |
yt-dlp | …to pass a YouTube URL to rag.transcribe() / autorag transcribe |
rag |
chromadb, umap-learn, scikit-learn, pydantic_sqlite, numpy | …rag.persist_transcription(), viz, or document RAG |
server |
fastapi, uvicorn[standard] | …autorag serve / the HTTP API |
broker |
pika | …the async RabbitMQ job pipeline (autorag.services, the /jobs/* endpoints + workers). Sync SDK/CLI/API never need it |
all |
everything above | …the full local-dev stack |
[diarize] is meant to ride on top of [audio] — pyannote needs the same torch + ffmpeg stack. Install both together: pip install 'autorag[audio,diarize]'.
/viz is served from a Vite-built React + TypeScript bundle. Source lives in
frontend/; built output lives in src/autorag/static/viz/ and is committed
to git so the Python wheel and CI don't need a node toolchain.
Rebuild after editing anything under frontend/src/:
cd frontend && npm install && npm run buildThe build writes index.html + hashed assets/* into src/autorag/static/viz/
(via outDir in frontend/vite.config.ts). src/autorag/viz.py serves the
HTML at /viz; src/autorag/api.py mounts the assets dir at /viz-assets.
For interactive development, cd frontend && npm run dev runs Vite on
http://localhost:5173 with /viz/data and /viz/search proxied to a separately
running autorag serve on port 8000.
- Bump
__version__insrc/autorag/__init__.pyandversioninpyproject.toml. uv lockto refresh, commit.git tag v0.x.0 && git push --tags.
Consumers then pin to the tag: pip install "autorag[...] @ git+https://github.com/AutoLogger/AutoRAG@v0.x.0".
autorag transcribe SOURCE [OPTIONS]
SOURCE Audio file path or YouTube URL
(youtube.com / youtu.be / m.youtube.com / music.youtube.com)
--title -t TEXT Clip title (defaults to YouTube video title for URLs, else filename stem / video id)
--whisper-model -w TEXT Whisper model: tiny/base/small/medium/large [default: base]
--language -l TEXT Whisper language code (auto-detect if empty)
--persist/--no-persist Write word spans to SQLite (default: true)
--db PATH Override database path
Runs Whisper + diarization and outputs word spans as JSON on stdout. With --persist (default), the word spans are written to SQLite. Session IDs are stable: local paths map to UUID5 of the resolved path; YouTube URLs collapse to a canonical https://www.youtube.com/watch?v=<id> form. For LLM topic extraction, use autorag generate-topics.
autorag generate-topics SOURCE [OPTIONS]
SOURCE Audio file path or YouTube URL
--title -t TEXT Clip title
--whisper-model -w TEXT Whisper model [default: base]
--provider -p TEXT LLM provider (ollama) [default: ollama]
--llm-model -m TEXT LLM model name [default: gemma4:latest]
--language -l TEXT Whisper language code (auto-detect if empty)
--transcription -T TEXT Pre-computed word spans as a JSON string (skip Whisper)
--persist/--no-persist Write transcription + topics to SQLite/Chroma (default: true)
--db PATH Override database path
Full pipeline: transcribes (or reads from SQLite cache / --transcription), runs the five-stage LLM topic extraction, and persists everything (word spans + topic tree + Chroma embeddings). Outputs the persisted topic JSON to stdout; timing breakdown goes to stderr:
=== Topic Generation Timing Breakdown ===
whisper 12.341s
agent 21.842s
cli_store_words 0.003s
cli_finalize 0.005s
cli_embed 0.231s
────────────────────────────────────
TOTAL 34.422s
device: cuda
The agent stage covers all five LLM passes (L1 boundaries, subdivide decisions, L2 boundaries, per-node summarization, L0 aggregation). YouTube URLs are downloaded to a temp .webm (via yt-dlp, requires [youtube]); title, created_at, and file_path are populated from yt-dlp's info dict.
autorag blocks SOURCE [OPTIONS]
SOURCE Audio file path or YouTube URL
--seconds -n INT Time-block window length in seconds [default: 10]
--force-retranscribe Re-run transcription even if cached
--title -t TEXT Clip title (only used on cache miss)
--whisper-model -w TEXT Whisper model [default: base]
--provider -p TEXT LLM provider [default: ollama]
--llm-model -m TEXT LLM model name [default: gemma4:latest]
--language -l TEXT Whisper language code (auto-detect if empty)
--db PATH Override database path
Prints the cached transcript as N-second time blocks, one MM:SS-MM:SS Speaker K: ... line per speaker turn within each block. Reads from SQLite when the source has been transcribed before (no [audio] extra needed for the cache hit); otherwise runs the full transcribe + persist pipeline first, then formats. Same SDK call: AutoRAG.transcribe_blocks(source, seconds=10). The pure formatter is exposed as from autorag import format_blocks for callers who already have a WordSpan list in hand.
autorag ingest PATH [PATH ...]
Ingest one or more files or directories into the vector store.
autorag query QUESTION [--top-k K]
Ask a question against the ingested corpus and print the generated answer.
autorag serve [--host HOST] [--port PORT] [--reload]
Run the HTTP API server (default: http://127.0.0.1:8000).
Optional async pipeline (needs autorag[broker,rag] + a running RabbitMQ). Decoupled from the synchronous commands above — installing or running it changes nothing about transcribe / generate-topics / serve.
autorag jobs submit SOURCE [OPTIONS]
SOURCE Audio file path or YouTube URL
--title -t TEXT Clip title
--whisper-model -w TEXT Whisper model [default: base]
--llm-model -m TEXT LLM model name [default: gemma4:latest]
--language -l TEXT Whisper language code [default: en]
autorag jobs status JOB_ID
submit enqueues the job on the broker and prints {"job_id": ..., "session_id": ...}; status prints the job's status + per-stage state as JSON. A finished async job writes the same SQLite/Chroma rows a CLI run would, so /viz and every other reader work unchanged. Without [broker]/[rag] the commands exit with an install hint. See the "Async pipeline" section of CLAUDE.md for the architecture.
To deploy the broker + workers, use the repo-root docker-compose.yml (docker compose up -d — RabbitMQ + the GPU/IO workers). Note it does not start the API server (run autorag serve yourself with the same AUTORAG_DB_PATH/AUTORAG_BROKER_URL, or just use the CLI above) and Ollama stays external (AUTORAG_OLLAMA_BASE_URL). Full topology + caveats: the "Async job pipeline" section of the server guide.
Start the server with autorag serve, then:
| Method | Path | Description |
|---|---|---|
| GET | /health |
Returns {"status": "ok"} |
| POST | /ingest |
Ingest files — body: {"paths": [...]} |
| POST | /query |
Ask a question — body: {"question": "...", "top_k": 5} |
| GET | /viz |
Interactive 3-D topic scatter (HTML) |
| GET | /viz/data |
UMAP 3-D coordinates + cluster labels + similarity edges (JSON) |
| GET | /viz/search |
Semantic search over topics — params: q=<query>, top_k=10 (JSON) |
| POST | /jobs/audio |
Enqueue an async audio→topics job → 202 + job_id (needs [broker,rag]) |
| GET | /jobs/{id} |
Job status + per-stage state |
| GET | /jobs/{id}/result |
Finished clip row; 409 until the job is done |
Ollama is the only supported provider. It runs locally — no API key required.
| Provider | Env var | Default model | Notes |
|---|---|---|---|
| ollama | (none — local) | gemma4:latest | (built-in) |
Ollama is invoked via LangChain (langchain-ollama). The provider constructs messages with SystemMessage/HumanMessage and calls ChatOllama.with_structured_output(schema, method="json_schema") to enforce the topic-tree JSON schema. Embeddings are generated with OllamaEmbeddings.embed_documents().
| Variable | Default | Description |
|---|---|---|
AUTORAG_OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL (used by both the agent and the embedder) |
AUTORAG_DB_PATH |
~/.autorag/autorag.db |
SQLite database path |
AUTORAG_CHUNK_SIZE |
1000 |
Characters per chunk when ingesting |
AUTORAG_CHUNK_OVERLAP |
200 |
Overlap between consecutive chunks |
AUTORAG_WHISPER_DEVICE |
auto |
auto, cpu, or cuda (Whisper + pyannote) |
AUTORAG_EMBED_MODEL |
nomic-embed-text |
Ollama model for topic title embeddings |
HF_TOKEN |
(unset) | HuggingFace token for pyannote/speaker-diarization-3.1. Without it, every word is labeled speaker "0". |
Whisper and PyTorch ship with the [audio] extra; pyannote with [diarize]. See Install as a library for the extras matrix.
Single SQLite database at ~/.autorag/autorag.db (override with AUTORAG_DB_PATH).
CREATE TABLE audio_clips (
id TEXT PRIMARY KEY, -- UUID5 (stable per resolved file path)
title TEXT NOT NULL, -- user-supplied or filename stem
file_path TEXT NOT NULL,
created_at TEXT NOT NULL, -- ISO 8601 UTC (file mtime)
transcription TEXT, -- JSON: word-level transcript (see below)
topics TEXT, -- JSON: topic list (see below)
whisper_model TEXT, -- e.g. "base"
provider TEXT, -- e.g. "ollama"
llm_model TEXT -- e.g. "gemma4:latest"
);Topic embeddings live alongside the SQLite db in a persistent Chroma collection (<db_dir>/chroma/, collection audio_clip_topics, cosine distance), keyed by <clip_id>:<topic_index>.
Word-level timestamps from Whisper, flattened to absolute offsets from audio start:
[
{"w": " Hello", "s": 0.0, "e": 0.4, "speaker": "0"},
{"w": " world", "s": 0.4, "e": 0.8, "speaker": "1"}
]| Field | Description |
|---|---|
w |
Word token (may include leading space) |
s |
Start time (seconds from audio start) |
e |
End time (seconds from audio start) |
speaker |
Speaker label "0", "1", … normalized in first-appearance order. "0" when diarization is disabled. |
Hierarchical topics produced by the LLM, flattened to a list sorted by start_s:
[
{"title": "Introduction", "level": 1, "start_s": 0.0, "duration_s": 42.1, "number": "1", "summary": "Speaker introduces the session goals."},
{"title": "Setup", "level": 2, "start_s": 5.2, "duration_s": 15.0, "number": "1.1", "summary": "Environment prerequisites are reviewed."},
{"title": "Config", "level": 2, "start_s": 20.4, "duration_s": 21.7, "number": "1.2", "summary": "Key config values and their effects."}
]| Field | Description |
|---|---|
title |
LLM-generated topic label (≤120 chars) |
summary |
2–4 sentence description of what was discussed |
level |
Depth: 1 = top-level, 2 = subtopic, 3 = sub-subtopic |
start_s |
Offset from audio start where this topic begins (seconds) |
duration_s |
Duration; the last sibling at each level extends to the transcript end |
number |
Hierarchical label, e.g. "1.2.3" |
Each topic's "<title>. <summary>" (or just title when there is no summary) is embedded with the Ollama embedding model (default: nomic-embed-text) and upserted into the audio_clip_topics Chroma collection. Each record carries the embedding plus metadata (clip_id, clip_title, topic_index, title, summary, level, start_s, duration_s, number); ids are <clip_id>:<topic_index> and topic_index refers to the position within the clip's filtered (title-bearing) topic list. Used by /viz/data and /viz/search.
GET /viz serves an interactive 3-D scatter of all stored topics. The pipeline:
- Embed — topic titles + summaries are embedded via Ollama. Stored embeddings are reused; missing ones are computed on demand.
- Project — embeddings are projected to 3 dimensions via UMAP (
metric=cosine,n_neighbors=15). - Cluster — topics are grouped with agglomerative clustering (
metric=cosine,linkage=average,distance_threshold=0.35). Threshold is tunable via thedistance_thresholdquery param (0.0–1.0). - Edges — for each topic the top-5 cosine-similar neighbours above 0.60 similarity are wired as undirected edges in the scatter.
- Render — the browser renders the 3-D scatter with Three.js. Points are coloured by cluster; edges are drawn as lines. Hovering shows the topic title, clip, and summary.
{
"points": [
{
"topic_title": "Introduction",
"clip_id": "...",
"clip_title": "Session 1",
"level": 1,
"start_s": 0.0,
"duration_s": 42.1,
"number": "1",
"summary": "...",
"x": 0.12,
"y": -0.34,
"z": 0.09,
"cluster_id": 2
}
],
"edges": [{"a": 0, "b": 4, "similarity": 0.82}],
"clip_ids": ["..."],
"clip_titles": {"...": "Session 1"},
"total_topics": 47,
"total_clips": 3,
"total_clusters": 8
}GET /viz/search?q=gradient+descent&top_k=5
[
{
"point_index": 12,
"topic_title": "Backpropagation deep-dive",
"clip_title": "ML Lecture 3",
"clip_id": "...",
"similarity": 0.91,
"summary": "..."
}
]autorag transcribe FILE
│
├─ db.create_clip() Register file in SQLite
├─ agent.transcribe() Whisper + 5-stage LLM pipeline
│ ├─ whisper_runner.get_model() → .transcribe_segment()
│ ├─ Stage 2: L1 boundaries (1 LLM call)
│ ├─ Stage 3a: decide subdivide (N LLM calls, batched)
│ ├─ Stage 3b: L2 boundaries (M LLM calls, batched)
│ ├─ Stage 4: per-node summaries (K LLM calls, batched)
│ └─ Stage 5: L0 aggregate (1 LLM call)
├─ _collapse_lone_children() Drop single-child chains
├─ db.store_transcription() Persist word spans
├─ _topics_to_events() → db.add_analytics_event() × N
├─ db.finalize_topics() Compute durations, persist topics JSON
└─ Embedder().embed_texts() Ollama embed → ChromaStore.add_topic_embeddings()
autorag serve
└─ FastAPI
├─ /ingest POST → core.AutoRAG.ingest()
├─ /query POST → core.AutoRAG.query()
├─ /viz → static/viz/index.html (Vite-built React + TS bundle)
├─ /viz-assets/* → static/viz/assets/* (StaticFiles mount)
├─ /viz/data GET → viz.viz_data()
│ ├─ db.list_clips()
│ ├─ ChromaStore.get_clip_embeddings() (per clip)
│ ├─ Embedder().embed_texts() (missing vecs only)
│ ├─ viz.umap_3d()
│ ├─ topic_cluster.cluster_embeddings()
│ └─ topic_cluster.build_edges()
└─ /viz/search GET → viz.viz_search()
├─ Embedder().embed_texts([q])
└─ ChromaStore.query(query_vec, top_k)