added architecture diagram in README by AmitKarnam · Pull Request #90 · Siddhant-K-code/distill

AmitKarnam · 2026-05-09T07:31:10Z

Added an architecture diagram in the README, which would help with understanding the product

AmitKarnam · 2026-05-09T07:32:32Z

@Siddhant-K-code was playing around with Distill and found it an interesting project. Felt like it missed a high-level architecture diagram, which would fasten the understanding of the product.

Siddhant-K-code · 2026-05-09T07:43:59Z

Hey @AmitKarnam, thanks for the PR!

I've just pushed couple of PRs & releases, so you might need to update your diagram!

AmitKarnam · 2026-05-09T09:53:23Z

@Siddhant-K-code have updated the architecture diagram. PTAL

AmitKarnam · 2026-05-12T18:00:39Z

@Siddhant-K-code PTAL

Siddhant-K-code · 2026-05-14T09:34:15Z

Diagram review

The core dedup flow is accurate — the node labels, field names, and logic all match the code. A few things worth fixing before merging.

❌ The write-time dedup arrow is wrong

The dashed red box ("Write-time dedupe / semantic dedup & conflict detection") has an arrow pointing back into the main pipeline. That connection doesn't exist. POST /v1/memory/store and POST /v1/dedupe are completely independent endpoints — they share no call path. The arrow implies a feedback loop that will mislead anyone trying to understand how memory relates to the dedup pipeline.

Fix: Remove the arrow. If you want to show the memory subsystem, draw it as a standalone box with a note that it's a separate endpoint.

⚠️ The "If Embedding's Present" bypass arrow is ambiguous

The dashed bypass goes from Partition to Cosine Distance, which is correct in intent (skip the embedding provider if embeddings are already on the chunks). But visually it lands mid-node, making it unclear whether Cosine Distance itself is also skipped.

Fix: Route the arrow to the input of Cosine Distance so it's clear only the Embeddings node is bypassed.

⚠️ The diagram only covers `/v1/dedupe`

The title says "architecture diagram" but this is the dedup pipeline only. Compress, summarize, cache, and the pipeline runner (POST /v1/pipeline) are all absent. A reader won't understand that dedup is one stage of four.

Fix: Either rename it to "Dedup pipeline" or add the other stages.

Minor

stats{input_count, output_count, ...} — the ... hides cluster_count, which is the primary signal of how much dedup occurred. Worth spelling out.
No legend for dashed vs solid arrows or the color coding (red/yellow/blue). Intuitive, but a one-line legend would help.

What's correct

Input fields {id, text, embedding?, score?, cache_control?} — exact match to DedupeChunk
Partition logic, cosine distance formula, clustering via contextlab, selection by score, MMR guard (target_k > 0), and the final merge (prepend prefix + append representatives) — all accurate
Output fields {id, text, score, cluster_id} — correct, and correctly omits embeddings (they're stripped from the response)

Siddhant-K-code · 2026-05-24T15:32:01Z


 **Result:** Deterministic, diverse context. No LLM calls. Fully auditable.

+### Dedup pipeline architecture diagram


Remove this heading — it duplicates the section it's inside.

The parent section is already ### Dedup pipeline. Adding another heading called ### Dedup pipeline architecture diagram immediately inside it is redundant.

Pick one:

Dedup-only diagram → drop the heading entirely, the image is self-explanatory.

Full system architecture → move the image above ## Installation and give it its own ### Architecture section.

Siddhant-K-code · 2026-05-24T15:32:05Z


+### Dedup pipeline architecture diagram
+![Dedup pipeline architecture diagram](docs/images/architecture-diagram.png)
+


Extra blank line — one is enough here. Delete this line.

Siddhant-K-code · 2026-05-24T15:32:21Z

Thanks for adding the diagram — visuals help here. Two things to sort out before this merges:

1. What does the diagram actually show?
The PR title says "architecture diagram" (full system), but it's placed inside the ### Dedup pipeline subsection. These are different things. Please clarify in the PR description what the diagram covers, then place it accordingly:

Full system → move above ## Installation, add a ### Architecture section.
Dedup pipeline only → keep it here, update the PR title.

2. The heading is redundant (see inline comment on line 63).

Once those two are resolved this should be good to go.

Siddhant-K-code · 2026-05-24T15:38:26Z

Diagram accuracy review — checked against the implementation.

The overall flow is correct and the diagram is a useful addition. A few inaccuracies to fix:

❌ Input fields — cache_control? should be cache_control

The diagram shows cache_control? as if it's an optional field name. The actual field is CacheControl string — it's always present in the struct, just empty when unused. The ? notation is misleading. Use cache_control (no ?), or annotate it as // optional in a legend.

❌ Selection node — "by score" is hardcoded, not a general rule

The diagram says "pick representative per cluster (by score)". This is accurate for the current HTTP handler (it hardcodes SelectByScore), but the selector package supports four strategies: score, centroid, length, and hybrid. If the diagram is meant to document the pipeline as-is, add "(default)" to make clear this isn't the only option. If it's meant to document the full system, show the strategy as a configurable input.

❌ MMR node — trigger condition is wrong

The diagram says "Optional MMR — if target_k set". The actual trigger is:

target_k > 0  AND  len(representatives) > target_k

If target_k is set but the number of representatives is already ≤ target_k, MMR is skipped. The diagram implies MMR always runs when target_k is provided, which isn't true.

Also: lambda controls diversity (not just "lambda controls diversity" in isolation) — the formula is λ × relevance − (1−λ) × max_similarity_to_selected. Worth adding a brief note so readers understand the direction: λ=1 = pure relevance, λ=0 = pure diversity.

❌ Output — missing fields

The output box shows stats{input_count, output_count, cluster_count ...} — the ... hides fields that are worth calling out explicitly, especially reduction_pct and the cache-prefix stats (cache_prefix_frozen, suffix_input_count, suffix_output_count). These are the fields most consumers will actually use.

Also, chunks[{id, text, score, cluster_id}] is correct ✅ — but worth noting that cluster_id = -1 for frozen prefix chunks.

✅ Things that are correct

Partition logic (frozen prefix + dedup suffix) ✅
Embedding fallback (use provided or call provider) ✅
Providers listed (OpenAI / Ollama / Cohere) ✅
Cosine distance formula (distance = 1 − cosine_similarity) ✅
Clustering description (pairwise distances → merge when dist ≤ threshold) ✅
Merge step (prepend frozen prefix + append representatives) ✅
Dashed bypass line for when embeddings are present ✅

added architecture diagram in README

81f58a1

Updated the architecture diagram with wrtie-time dedupe

e06763e

Siddhant-K-code requested changes May 10, 2026

View reviewed changes

Comment thread README.md Outdated

Siddhant-K-code reviewed May 10, 2026

View reviewed changes

Comment thread docs/images/architecture-diagram.png

docs: update architecture diagram and improve alt text

a171892

AmitKarnam added 3 commits May 15, 2026 17:48

docs: update architecture diagram title for clarity

37a44c8

docs: update architecture diagram header for clarity

b753ece

docs: update architecture diagram image

91c92a5

Siddhant-K-code reviewed May 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added architecture diagram in README#90

added architecture diagram in README#90
AmitKarnam wants to merge 6 commits into
Siddhant-K-code:mainfrom
AmitKarnam:docs/add-architecture-diagram

AmitKarnam commented May 9, 2026

Uh oh!

AmitKarnam commented May 9, 2026

Uh oh!

Siddhant-K-code commented May 9, 2026

Uh oh!

AmitKarnam commented May 9, 2026

Uh oh!

Uh oh!

Uh oh!

AmitKarnam commented May 12, 2026

Uh oh!

Siddhant-K-code commented May 14, 2026

Uh oh!

Siddhant-K-code May 24, 2026

Uh oh!

Siddhant-K-code May 24, 2026

Uh oh!

Siddhant-K-code commented May 24, 2026

Uh oh!

Siddhant-K-code commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Result: Deterministic, diverse context. No LLM calls. Fully auditable.

		### Dedup pipeline architecture diagram


		### Dedup pipeline architecture diagram
		![Dedup pipeline architecture diagram](docs/images/architecture-diagram.png)

Uh oh!

Conversation

AmitKarnam commented May 9, 2026

Uh oh!

AmitKarnam commented May 9, 2026

Uh oh!

Siddhant-K-code commented May 9, 2026

Uh oh!

AmitKarnam commented May 9, 2026

Uh oh!

Uh oh!

Uh oh!

AmitKarnam commented May 12, 2026

Uh oh!

Siddhant-K-code commented May 14, 2026

Diagram review

❌ The write-time dedup arrow is wrong

⚠️ The "If Embedding's Present" bypass arrow is ambiguous

⚠️ The diagram only covers /v1/dedupe

Minor

What's correct

Uh oh!

Siddhant-K-code May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Siddhant-K-code May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Siddhant-K-code commented May 24, 2026

Uh oh!

Siddhant-K-code commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚠️ The diagram only covers `/v1/dedupe`