Skip to content

added architecture diagram in README#90

Open
AmitKarnam wants to merge 6 commits into
Siddhant-K-code:mainfrom
AmitKarnam:docs/add-architecture-diagram
Open

added architecture diagram in README#90
AmitKarnam wants to merge 6 commits into
Siddhant-K-code:mainfrom
AmitKarnam:docs/add-architecture-diagram

Conversation

@AmitKarnam

Copy link
Copy Markdown

Added an architecture diagram in the README, which would help with understanding the product

@AmitKarnam

Copy link
Copy Markdown
Author

@Siddhant-K-code was playing around with Distill and found it an interesting project. Felt like it missed a high-level architecture diagram, which would fasten the understanding of the product.

@Siddhant-K-code

Copy link
Copy Markdown
Owner

Hey @AmitKarnam, thanks for the PR!

I've just pushed couple of PRs & releases, so you might need to update your diagram!

@AmitKarnam

Copy link
Copy Markdown
Author

@Siddhant-K-code have updated the architecture diagram. PTAL

Comment thread README.md Outdated
Comment thread docs/images/architecture-diagram.png
@AmitKarnam

Copy link
Copy Markdown
Author

@Siddhant-K-code PTAL

@Siddhant-K-code

Copy link
Copy Markdown
Owner

Diagram review

The core dedup flow is accurate — the node labels, field names, and logic all match the code. A few things worth fixing before merging.


❌ The write-time dedup arrow is wrong

The dashed red box ("Write-time dedupe / semantic dedup & conflict detection") has an arrow pointing back into the main pipeline. That connection doesn't exist. POST /v1/memory/store and POST /v1/dedupe are completely independent endpoints — they share no call path. The arrow implies a feedback loop that will mislead anyone trying to understand how memory relates to the dedup pipeline.

Fix: Remove the arrow. If you want to show the memory subsystem, draw it as a standalone box with a note that it's a separate endpoint.


⚠️ The "If Embedding's Present" bypass arrow is ambiguous

The dashed bypass goes from Partition to Cosine Distance, which is correct in intent (skip the embedding provider if embeddings are already on the chunks). But visually it lands mid-node, making it unclear whether Cosine Distance itself is also skipped.

Fix: Route the arrow to the input of Cosine Distance so it's clear only the Embeddings node is bypassed.


⚠️ The diagram only covers /v1/dedupe

The title says "architecture diagram" but this is the dedup pipeline only. Compress, summarize, cache, and the pipeline runner (POST /v1/pipeline) are all absent. A reader won't understand that dedup is one stage of four.

Fix: Either rename it to "Dedup pipeline" or add the other stages.


Minor

  • stats{input_count, output_count, ...} — the ... hides cluster_count, which is the primary signal of how much dedup occurred. Worth spelling out.
  • No legend for dashed vs solid arrows or the color coding (red/yellow/blue). Intuitive, but a one-line legend would help.

What's correct

  • Input fields {id, text, embedding?, score?, cache_control?} — exact match to DedupeChunk
  • Partition logic, cosine distance formula, clustering via contextlab, selection by score, MMR guard (target_k > 0), and the final merge (prepend prefix + append representatives) — all accurate
  • Output fields {id, text, score, cluster_id} — correct, and correctly omits embeddings (they're stripped from the response)

Comment thread README.md

**Result:** Deterministic, diverse context. No LLM calls. Fully auditable.

### Dedup pipeline architecture diagram

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this heading — it duplicates the section it's inside.

The parent section is already ### Dedup pipeline. Adding another heading called ### Dedup pipeline architecture diagram immediately inside it is redundant.

Pick one:

  • Dedup-only diagram → drop the heading entirely, the image is self-explanatory.
  • Full system architecture → move the image above ## Installation and give it its own ### Architecture section.

Comment thread README.md

### Dedup pipeline architecture diagram
![Dedup pipeline architecture diagram](docs/images/architecture-diagram.png)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra blank line — one is enough here. Delete this line.

@Siddhant-K-code

Copy link
Copy Markdown
Owner

Thanks for adding the diagram — visuals help here. Two things to sort out before this merges:

1. What does the diagram actually show?
The PR title says "architecture diagram" (full system), but it's placed inside the ### Dedup pipeline subsection. These are different things. Please clarify in the PR description what the diagram covers, then place it accordingly:

  • Full system → move above ## Installation, add a ### Architecture section.
  • Dedup pipeline only → keep it here, update the PR title.

2. The heading is redundant (see inline comment on line 63).

Once those two are resolved this should be good to go.

@Siddhant-K-code

Copy link
Copy Markdown
Owner

Diagram accuracy review — checked against the implementation.

The overall flow is correct and the diagram is a useful addition. A few inaccuracies to fix:


❌ Input fields — cache_control? should be cache_control

The diagram shows cache_control? as if it's an optional field name. The actual field is CacheControl string — it's always present in the struct, just empty when unused. The ? notation is misleading. Use cache_control (no ?), or annotate it as // optional in a legend.


❌ Selection node — "by score" is hardcoded, not a general rule

The diagram says "pick representative per cluster (by score)". This is accurate for the current HTTP handler (it hardcodes SelectByScore), but the selector package supports four strategies: score, centroid, length, and hybrid. If the diagram is meant to document the pipeline as-is, add "(default)" to make clear this isn't the only option. If it's meant to document the full system, show the strategy as a configurable input.


❌ MMR node — trigger condition is wrong

The diagram says "Optional MMR — if target_k set". The actual trigger is:

target_k > 0  AND  len(representatives) > target_k

If target_k is set but the number of representatives is already ≤ target_k, MMR is skipped. The diagram implies MMR always runs when target_k is provided, which isn't true.

Also: lambda controls diversity (not just "lambda controls diversity" in isolation) — the formula is λ × relevance − (1−λ) × max_similarity_to_selected. Worth adding a brief note so readers understand the direction: λ=1 = pure relevance, λ=0 = pure diversity.


❌ Output — missing fields

The output box shows stats{input_count, output_count, cluster_count ...} — the ... hides fields that are worth calling out explicitly, especially reduction_pct and the cache-prefix stats (cache_prefix_frozen, suffix_input_count, suffix_output_count). These are the fields most consumers will actually use.

Also, chunks[{id, text, score, cluster_id}] is correct ✅ — but worth noting that cluster_id = -1 for frozen prefix chunks.


✅ Things that are correct

  • Partition logic (frozen prefix + dedup suffix) ✅
  • Embedding fallback (use provided or call provider) ✅
  • Providers listed (OpenAI / Ollama / Cohere) ✅
  • Cosine distance formula (distance = 1 − cosine_similarity) ✅
  • Clustering description (pairwise distances → merge when dist ≤ threshold) ✅
  • Merge step (prepend frozen prefix + append representatives) ✅
  • Dashed bypass line for when embeddings are present ✅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants