AutoRAG

Transcribe audio files with Whisper, summarize into a 3-level hierarchical topic outline with an LLM, and store everything in a local SQLite database. Includes a semantic visualization layer (UMAP 3-D scatter, agglomerative clustering, cosine-similarity search) and a RAG scaffold (ingest → embed → retrieve → generate) exposed via CLI and HTTP API.

Documentation

📖 Hosted docs: https://autologger.github.io/AutoRAG/ — built from docs/ with Sphinx and published to GitHub Pages on every push to main (.github/workflows/docs.yml).

To build the same site locally:

uv sync --group docs
uv run make -C docs html
# open docs/_build/html/index.html

The site includes a user guide (installation, quickstart, per-feature walkthroughs), an internals reference (architecture, extras model, audio pipeline design, Ollama tuning, frontend, packaging), and an autodoc-generated API reference for every module under src/autorag/.

Quickstart

# Install full stack (audio + diarization + RAG + server + YouTube download)
uv sync --all-extras

# Transcribe a local audio file (saves word spans to SQLite)
autorag transcribe session.webm

# Run the full pipeline: Whisper + LLM topics + Chroma embeddings
autorag generate-topics session.webm

# …or a YouTube URL — yt-dlp downloads the audio to a temp .webm first
autorag generate-topics https://www.youtube.com/watch?v=dQw4w9WgXcQ

generate-topics prints the persisted topic JSON to stdout; timing info goes to stderr. The database is written to ~/.autorag/autorag.db by default.

Install as a library

AutoRAG is also a pip-installable SDK. Install from a tagged release on GitHub:

# Audio → topics agent only (Whisper + diarization)
pip install "autorag[audio,diarize] @ git+https://github.com/AutoLogger/AutoRAG@v0.4.0"

# Add YouTube URL support (yt-dlp)
pip install "autorag[audio,diarize,youtube] @ git+https://github.com/AutoLogger/AutoRAG@v0.4.0"

# Full stack (also installs Chroma + UMAP + FastAPI)
pip install "autorag[all] @ git+https://github.com/AutoLogger/AutoRAG@v0.4.0"

from autorag import AutoRAG

rag = AutoRAG()

# Step 1: Whisper + diarization → word spans
words = rag.transcribe("meeting.wav")
# Or a YouTube URL (requires [youtube] extra):
words = rag.transcribe("https://youtu.be/dQw4w9WgXcQ")

# Step 2: LLM topic extraction (requires [audio,diarize] for LangChain/Ollama)
topics = rag.generate_topics(words)
print(topics["topics"])           # hierarchical topic tree (L0/L1/L2)

# Step 3a: persist word spans to SQLite (requires [rag] extra)
rag.persist_transcription("meeting.wav", words, title="Weekly sync")

# Step 3b: persist topic tree + embed titles into Chroma (requires [rag] extra)
rag.persist_topics("meeting.wav", topics, words=words, title="Weekly sync")

Extras

Extra	Adds	Use when you want…
`audio`	whisperx, torch, imageio-ffmpeg	…to call `rag.transcribe()` / `rag.build_agent()`
`diarize`	pyannote.audio, huggingface-hub	…speaker labels (combine with `audio`)
`youtube`	yt-dlp	…to pass a YouTube URL to `rag.transcribe()` / `autorag transcribe`
`rag`	chromadb, umap-learn, scikit-learn, pydantic_sqlite, numpy	…`rag.persist_transcription()`, viz, or document RAG
`server`	fastapi, uvicorn[standard]	…`autorag serve` / the HTTP API
`broker`	pika	…the async RabbitMQ job pipeline (`autorag.services`, the `/jobs/*` endpoints + workers). Sync SDK/CLI/API never need it
`all`	everything above	…the full local-dev stack

[diarize] is meant to ride on top of [audio] — pyannote needs the same torch + ffmpeg stack. Install both together: pip install 'autorag[audio,diarize]'.

Frontend build (`/viz` page)

/viz is served from a Vite-built React + TypeScript bundle. Source lives in frontend/; built output lives in src/autorag/static/viz/ and is committed to git so the Python wheel and CI don't need a node toolchain.

Rebuild after editing anything under frontend/src/:

cd frontend && npm install && npm run build

The build writes index.html + hashed assets/* into src/autorag/static/viz/ (via outDir in frontend/vite.config.ts). src/autorag/viz.py serves the HTML at /viz; src/autorag/api.py mounts the assets dir at /viz-assets. For interactive development, cd frontend && npm run dev runs Vite on http://localhost:5173 with /viz/data and /viz/search proxied to a separately running autorag serve on port 8000.

Releasing a new version

Bump __version__ in src/autorag/__init__.py and version in pyproject.toml.
uv lock to refresh, commit.
git tag v0.x.0 && git push --tags.

Consumers then pin to the tag: pip install "autorag[...] @ git+https://github.com/AutoLogger/AutoRAG@v0.x.0".

CLI

`autorag transcribe`

autorag transcribe SOURCE [OPTIONS]

  SOURCE                        Audio file path or YouTube URL
                                (youtube.com / youtu.be / m.youtube.com / music.youtube.com)
  --title            -t  TEXT   Clip title (defaults to YouTube video title for URLs, else filename stem / video id)
  --whisper-model    -w  TEXT   Whisper model: tiny/base/small/medium/large  [default: base]
  --language         -l  TEXT   Whisper language code (auto-detect if empty)
  --persist/--no-persist        Write word spans to SQLite (default: true)
  --db                   PATH   Override database path

Runs Whisper + diarization and outputs word spans as JSON on stdout. With --persist (default), the word spans are written to SQLite. Session IDs are stable: local paths map to UUID5 of the resolved path; YouTube URLs collapse to a canonical https://www.youtube.com/watch?v=<id> form. For LLM topic extraction, use autorag generate-topics.

`autorag generate-topics`

autorag generate-topics SOURCE [OPTIONS]

  SOURCE                        Audio file path or YouTube URL
  --title            -t  TEXT   Clip title
  --whisper-model    -w  TEXT   Whisper model  [default: base]
  --provider         -p  TEXT   LLM provider (ollama)  [default: ollama]
  --llm-model        -m  TEXT   LLM model name  [default: gemma4:latest]
  --language         -l  TEXT   Whisper language code (auto-detect if empty)
  --transcription    -T  TEXT   Pre-computed word spans as a JSON string (skip Whisper)
  --persist/--no-persist        Write transcription + topics to SQLite/Chroma (default: true)
  --db                   PATH   Override database path

Full pipeline: transcribes (or reads from SQLite cache / --transcription), runs the five-stage LLM topic extraction, and persists everything (word spans + topic tree + Chroma embeddings). Outputs the persisted topic JSON to stdout; timing breakdown goes to stderr:

=== Topic Generation Timing Breakdown ===
  whisper           12.341s
  agent             21.842s
  cli_store_words    0.003s
  cli_finalize       0.005s
  cli_embed          0.231s
  ────────────────────────────────────
  TOTAL             34.422s
  device: cuda

The agent stage covers all five LLM passes (L1 boundaries, subdivide decisions, L2 boundaries, per-node summarization, L0 aggregation). YouTube URLs are downloaded to a temp .webm (via yt-dlp, requires [youtube]); title, created_at, and file_path are populated from yt-dlp's info dict.

`autorag blocks`

autorag blocks SOURCE [OPTIONS]

  SOURCE                        Audio file path or YouTube URL
  --seconds          -n  INT    Time-block window length in seconds  [default: 10]
  --force-retranscribe          Re-run transcription even if cached
  --title            -t  TEXT   Clip title (only used on cache miss)
  --whisper-model    -w  TEXT   Whisper model  [default: base]
  --provider         -p  TEXT   LLM provider  [default: ollama]
  --llm-model        -m  TEXT   LLM model name  [default: gemma4:latest]
  --language         -l  TEXT   Whisper language code (auto-detect if empty)
  --db                   PATH   Override database path

Prints the cached transcript as N-second time blocks, one MM:SS-MM:SS Speaker K: ... line per speaker turn within each block. Reads from SQLite when the source has been transcribed before (no [audio] extra needed for the cache hit); otherwise runs the full transcribe + persist pipeline first, then formats. Same SDK call: AutoRAG.transcribe_blocks(source, seconds=10). The pure formatter is exposed as from autorag import format_blocks for callers who already have a WordSpan list in hand.

`autorag ingest`

autorag ingest PATH [PATH ...]

Ingest one or more files or directories into the vector store.

`autorag query`

autorag query QUESTION [--top-k K]

Ask a question against the ingested corpus and print the generated answer.

`autorag serve`

autorag serve [--host HOST] [--port PORT] [--reload]

Run the HTTP API server (default: http://127.0.0.1:8000).

`autorag jobs`

Optional async pipeline (needs autorag[broker,rag] + a running RabbitMQ). Decoupled from the synchronous commands above — installing or running it changes nothing about transcribe / generate-topics / serve.

autorag jobs submit SOURCE [OPTIONS]

  SOURCE                        Audio file path or YouTube URL
  --title            -t  TEXT   Clip title
  --whisper-model    -w  TEXT   Whisper model  [default: base]
  --llm-model        -m  TEXT   LLM model name  [default: gemma4:latest]
  --language         -l  TEXT   Whisper language code  [default: en]

autorag jobs status JOB_ID

submit enqueues the job on the broker and prints {"job_id": ..., "session_id": ...}; status prints the job's status + per-stage state as JSON. A finished async job writes the same SQLite/Chroma rows a CLI run would, so /viz and every other reader work unchanged. Without [broker]/[rag] the commands exit with an install hint. See the "Async pipeline" section of CLAUDE.md for the architecture.

To deploy the broker + workers, use the repo-root docker-compose.yml (docker compose up -d — RabbitMQ + the GPU/IO workers). Note it does not start the API server (run autorag serve yourself with the same AUTORAG_DB_PATH/AUTORAG_BROKER_URL, or just use the CLI above) and Ollama stays external (AUTORAG_OLLAMA_BASE_URL). Full topology + caveats: the "Async job pipeline" section of the server guide.

HTTP API

Start the server with autorag serve, then:

Method	Path	Description
GET	`/health`	Returns `{"status": "ok"}`
POST	`/ingest`	Ingest files — body: `{"paths": [...]}`
POST	`/query`	Ask a question — body: `{"question": "...", "top_k": 5}`
GET	`/viz`	Interactive 3-D topic scatter (HTML)
GET	`/viz/data`	UMAP 3-D coordinates + cluster labels + similarity edges (JSON)
GET	`/viz/search`	Semantic search over topics — params: `q=<query>`, `top_k=10` (JSON)
POST	`/jobs/audio`	Enqueue an async audio→topics job → `202` + `job_id` (needs `[broker,rag]`)
GET	`/jobs/{id}`	Job status + per-stage state
GET	`/jobs/{id}/result`	Finished clip row; `409` until the job is `done`

Provider

Ollama is the only supported provider. It runs locally — no API key required.

Provider	Env var	Default model	Notes
ollama	(none — local)	gemma4:latest	(built-in)

Ollama is invoked via LangChain (langchain-ollama). The provider constructs messages with SystemMessage/HumanMessage and calls ChatOllama.with_structured_output(schema, method="json_schema") to enforce the topic-tree JSON schema. Embeddings are generated with OllamaEmbeddings.embed_documents().

Environment variables

Variable	Default	Description
`AUTORAG_OLLAMA_BASE_URL`	`http://localhost:11434`	Ollama server URL (used by both the agent and the embedder)
`AUTORAG_DB_PATH`	`~/.autorag/autorag.db`	SQLite database path
`AUTORAG_CHUNK_SIZE`	`1000`	Characters per chunk when ingesting
`AUTORAG_CHUNK_OVERLAP`	`200`	Overlap between consecutive chunks
`AUTORAG_WHISPER_DEVICE`	`auto`	`auto`, `cpu`, or `cuda` (Whisper + pyannote)
`AUTORAG_EMBED_MODEL`	`nomic-embed-text`	Ollama model for topic title embeddings
`HF_TOKEN`	(unset)	HuggingFace token for `pyannote/speaker-diarization-3.1`. Without it, every word is labeled speaker `"0"`.

Whisper and PyTorch ship with the [audio] extra; pyannote with [diarize]. See Install as a library for the extras matrix.

Database schema

Single SQLite database at ~/.autorag/autorag.db (override with AUTORAG_DB_PATH).

CREATE TABLE audio_clips (
    id              TEXT PRIMARY KEY,   -- UUID5 (stable per resolved file path)
    title           TEXT NOT NULL,      -- user-supplied or filename stem
    file_path       TEXT NOT NULL,
    created_at      TEXT NOT NULL,      -- ISO 8601 UTC (file mtime)
    transcription   TEXT,               -- JSON: word-level transcript (see below)
    topics          TEXT,               -- JSON: topic list (see below)
    whisper_model   TEXT,               -- e.g. "base"
    provider        TEXT,               -- e.g. "ollama"
    llm_model       TEXT                -- e.g. "gemma4:latest"
);

Topic embeddings live alongside the SQLite db in a persistent Chroma collection (<db_dir>/chroma/, collection audio_clip_topics, cosine distance), keyed by <clip_id>:<topic_index>.

`transcription` column

Word-level timestamps from Whisper, flattened to absolute offsets from audio start:

[
  {"w": " Hello", "s": 0.0, "e": 0.4, "speaker": "0"},
  {"w": " world", "s": 0.4, "e": 0.8, "speaker": "1"}
]

Field	Description
`w`	Word token (may include leading space)
`s`	Start time (seconds from audio start)
`e`	End time (seconds from audio start)
`speaker`	Speaker label `"0"`, `"1"`, … normalized in first-appearance order. `"0"` when diarization is disabled.

`topics` column

Hierarchical topics produced by the LLM, flattened to a list sorted by start_s:

[
  {"title": "Introduction", "level": 1, "start_s": 0.0,  "duration_s": 42.1, "number": "1",   "summary": "Speaker introduces the session goals."},
  {"title": "Setup",        "level": 2, "start_s": 5.2,  "duration_s": 15.0, "number": "1.1", "summary": "Environment prerequisites are reviewed."},
  {"title": "Config",       "level": 2, "start_s": 20.4, "duration_s": 21.7, "number": "1.2", "summary": "Key config values and their effects."}
]

Field	Description
`title`	LLM-generated topic label (≤120 chars)
`summary`	2–4 sentence description of what was discussed
`level`	Depth: 1 = top-level, 2 = subtopic, 3 = sub-subtopic
`start_s`	Offset from audio start where this topic begins (seconds)
`duration_s`	Duration; the last sibling at each level extends to the transcript end
`number`	Hierarchical label, e.g. `"1.2.3"`

Topic embeddings (Chroma)

Each topic's "<title>. <summary>" (or just title when there is no summary) is embedded with the Ollama embedding model (default: nomic-embed-text) and upserted into the audio_clip_topics Chroma collection. Each record carries the embedding plus metadata (clip_id, clip_title, topic_index, title, summary, level, start_s, duration_s, number); ids are <clip_id>:<topic_index> and topic_index refers to the position within the clip's filtered (title-bearing) topic list. Used by /viz/data and /viz/search.

Visualization

GET /viz serves an interactive 3-D scatter of all stored topics. The pipeline:

Embed — topic titles + summaries are embedded via Ollama. Stored embeddings are reused; missing ones are computed on demand.
Project — embeddings are projected to 3 dimensions via UMAP (metric=cosine, n_neighbors=15).
Cluster — topics are grouped with agglomerative clustering (metric=cosine, linkage=average, distance_threshold=0.35). Threshold is tunable via the distance_threshold query param (0.0–1.0).
Edges — for each topic the top-5 cosine-similar neighbours above 0.60 similarity are wired as undirected edges in the scatter.
Render — the browser renders the 3-D scatter with Three.js. Points are coloured by cluster; edges are drawn as lines. Hovering shows the topic title, clip, and summary.

`/viz/data` response

{
  "points": [
    {
      "topic_title": "Introduction",
      "clip_id": "...",
      "clip_title": "Session 1",
      "level": 1,
      "start_s": 0.0,
      "duration_s": 42.1,
      "number": "1",
      "summary": "...",
      "x": 0.12,
      "y": -0.34,
      "z": 0.09,
      "cluster_id": 2
    }
  ],
  "edges": [{"a": 0, "b": 4, "similarity": 0.82}],
  "clip_ids": ["..."],
  "clip_titles": {"...": "Session 1"},
  "total_topics": 47,
  "total_clips": 3,
  "total_clusters": 8
}

`/viz/search` response

GET /viz/search?q=gradient+descent&top_k=5

[
  {
    "point_index": 12,
    "topic_title": "Backpropagation deep-dive",
    "clip_title": "ML Lecture 3",
    "clip_id": "...",
    "similarity": 0.91,
    "summary": "..."
  }
]

Architecture

autorag transcribe FILE
  │
  ├─ db.create_clip()              Register file in SQLite
  ├─ agent.transcribe()            Whisper + 5-stage LLM pipeline
  │    ├─ whisper_runner.get_model() → .transcribe_segment()
  │    ├─ Stage 2: L1 boundaries          (1 LLM call)
  │    ├─ Stage 3a: decide subdivide      (N LLM calls, batched)
  │    ├─ Stage 3b: L2 boundaries         (M LLM calls, batched)
  │    ├─ Stage 4: per-node summaries     (K LLM calls, batched)
  │    └─ Stage 5: L0 aggregate           (1 LLM call)
  ├─ _collapse_lone_children()     Drop single-child chains
  ├─ db.store_transcription()      Persist word spans
  ├─ _topics_to_events() → db.add_analytics_event() × N
  ├─ db.finalize_topics()          Compute durations, persist topics JSON
  └─ Embedder().embed_texts()      Ollama embed → ChromaStore.add_topic_embeddings()

autorag serve
  └─ FastAPI
       ├─ /ingest  POST → core.AutoRAG.ingest()
       ├─ /query   POST → core.AutoRAG.query()
       ├─ /viz          → static/viz/index.html  (Vite-built React + TS bundle)
       ├─ /viz-assets/* → static/viz/assets/*   (StaticFiles mount)
       ├─ /viz/data GET → viz.viz_data()
       │    ├─ db.list_clips()
       │    ├─ ChromaStore.get_clip_embeddings()  (per clip)
       │    ├─ Embedder().embed_texts()           (missing vecs only)
       │    ├─ viz.umap_3d()
       │    ├─ topic_cluster.cluster_embeddings()
       │    └─ topic_cluster.build_edges()
       └─ /viz/search GET → viz.viz_search()
            ├─ Embedder().embed_texts([q])
            └─ ChromaStore.query(query_vec, top_k)

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.claude		.claude
.devcontainer		.devcontainer
.github/workflows		.github/workflows
docs		docs
frontend		frontend
observability		observability
scripts		scripts
src/autorag		src/autorag
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.mcp.json		.mcp.json
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
docker-compose.test.yml		docker-compose.test.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoRAG

Documentation

Quickstart

Install as a library

Extras

Frontend build (`/viz` page)

Releasing a new version

CLI

`autorag transcribe`

`autorag generate-topics`

`autorag blocks`

`autorag ingest`

`autorag query`

`autorag serve`

`autorag jobs`

HTTP API

Provider

Environment variables

Database schema

`transcription` column

`topics` column

Topic embeddings (Chroma)

Visualization

`/viz/data` response

`/viz/search` response

Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoRAG

Documentation

Quickstart

Install as a library

Extras

Frontend build (/viz page)

Releasing a new version

CLI

autorag transcribe

autorag generate-topics

autorag blocks

autorag ingest

autorag query

autorag serve

autorag jobs

HTTP API

Provider

Environment variables

Database schema

transcription column

topics column

Topic embeddings (Chroma)

Visualization

/viz/data response

/viz/search response

Architecture

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Frontend build (`/viz` page)

`autorag transcribe`

`autorag generate-topics`

`autorag blocks`

`autorag ingest`

`autorag query`

`autorag serve`

`autorag jobs`

`transcription` column

`topics` column

`/viz/data` response

`/viz/search` response

Packages