[ENH] Group chunk siblings into one compaction partition#7133
Merged
Conversation
The PartitionOperator groups log records by record id so that all
operations on one id land in a single partition (it guarantees
in-order processing within a partition, but not across partitions).
Chunks of one document are distinct ids (`{base}-0`, `{base}-1`, ...),
so today they can be split across partitions and processed out of
order relative to each other.
This changes the grouping key to fold a trailing `-{digits}` chunk
suffix onto its base, so a document's chunks share a partition and
preserve per-document WAL order. That lets a downstream consumer — the
foundation attached function — rely on observing a trailing end-of-job
marker written to `{base}-0` only after every sibling chunk (ADR 0001
§6 in chroma-core/foundation).
The change only ever enlarges groups: a concrete id always maps to a
single key, so the existing "all ops for one id in one partition"
invariant holds. It is correctness-neutral for unrelated ids that
happen to end in `-{digits}` (they stay distinct records), but folds
them into a shared partition, which can reduce compaction parallelism
for collections that use a prefix-then-number id scheme.
DRAFT pending worker/sysdb review of that trade-off — see PR body.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Per review, the chunk-sibling folding is now opt-in per collection
rather than global:
- New PartitionInput.group_chunk_siblings flag; partition() only folds
`{base}-{idx}` onto a shared base when it's set, otherwise groups by
the exact id (unchanged behavior).
- New GROUP_CHUNK_SIBLINGS_METADATA_KEY ("chroma:group_chunk_siblings")
collection-metadata key. The log-fetch orchestrator reads it from the
collection's metadata and passes the resulting bool into
PartitionInput.
- The code that *sets* this flag on foundation source collections lands
in a separate PR; until then no collection carries it and behavior is
identical to before.
Tests: gated-on (siblings share a partition) and gated-off (siblings
stay distinct) cases, plus the chunk_grouping_key unit tests.
4 tasks
HammadB
reviewed
May 27, 2026
| /// collections (foundation source collections) lands in a separate PR. | ||
| /// Until that ships, no collection carries the flag and this operator | ||
| /// behaves exactly as before. | ||
| pub const GROUP_CHUNK_SIBLINGS_METADATA_KEY: &str = "chroma:group_chunk_siblings"; |
Collaborator
There was a problem hiding this comment.
this should not be duplicated, we can reimport it from where it is defined in the other PR that introduces this into foundation
Contributor
Author
There was a problem hiding this comment.
This is the PR that introduces it.
But to tidy things: moved the canonical definition into chroma_types as CHROMA_GROUP_CHUNK_SIBLINGS_KEY (next to the other chroma: metadata-key constants in api_types.rs) and replaced the local const here with a re-export
Move CHROMA_GROUP_CHUNK_SIBLINGS_KEY into chroma_types (next to the other chroma: metadata-key constants) and re-export it from partition_log as GROUP_CHUNK_SIBLINGS_METADATA_KEY, instead of duplicating the string literal in the worker. This gives the reader (worker PartitionOperator) and the writer (foundation /init) a single shared definition. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
LLay
added a commit
that referenced
this pull request
May 27, 2026
Foundation's /init endpoint now also ensures the source collections
(slack, notion — configurable via CHROMA_FOUNDATION__SOURCE_COLLECTIONS)
and sets the `chroma:group_chunk_siblings` metadata flag on each. That
flag opts the collection into chunk-sibling grouping in the worker's
PartitionOperator, so a job's chunk records stay in one partition and
the trailing end-of-job marker on `{base}-0` is observed after every
sibling chunk (ADR 0001 §6 in chroma-core/foundation).
- Promote the flag key to a shared constant
`chroma_types::CHROMA_GROUP_CHUNK_SIBLINGS_KEY` (next to the existing
CHROMA_* metadata keys) so the reader (worker) and writer
(foundation-api) share one definition. partition_log.rs now re-exports
it.
- foundation-api: new `source_collections` config (default
["slack","notion"]); `ensure_collection` takes optional metadata;
/init creates source collections with the flag and returns their ids.
Wiki collections are the function's *output* and intentionally do NOT
get the flag.
Stacked on the partition-operator change (chroma #7133), which reads
this flag. DRAFT — see PR body for the get-or-create idempotency
caveat.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
HammadB
reviewed
May 28, 2026
| /// The flag is *read* here; the code that *sets* it on foundation source | ||
| /// collections lives in the foundation `/init` endpoint. Re-exported from | ||
| /// `chroma_types` so reader and writer share one definition. | ||
| pub use chroma_types::CHROMA_GROUP_CHUNK_SIBLINGS_KEY as GROUP_CHUNK_SIBLINGS_METADATA_KEY; |
Collaborator
There was a problem hiding this comment.
do you really need this reexport?
Contributor
Author
There was a problem hiding this comment.
I suppose not! On my CLN TODO list
HammadB
approved these changes
May 28, 2026
LLay
added a commit
that referenced
this pull request
May 28, 2026
…7134) > **Draft.** Stacked on #7133 (the worker change that *reads* the chunk-sibling flag). Merge #7133 first, then rebase this onto main. ## Summary Extends the foundation `/init` endpoint to mirror the CLI POC (chroma-core/foundation #97), so `/init` is the single bootstrap for a team's foundation workspace. On top of the existing wiki + wiki_revisions creation, `/init` now: 1. **Ensures the source collections** (slack, notion; configurable via `CHROMA_FOUNDATION__SOURCE_COLLECTIONS`). 2. **Sets `chroma:group_chunk_siblings = true`** on each source collection so the worker's `PartitionOperator` (#7133) keeps a job's chunk records in one partition — the ordering the end-of-job marker relies on (ADR 0001 §6). 3. **Attaches the foundation function** to each source collection via `SysDb::create_attached_function`, with the wiki collection as output — the server-side equivalent of the POC's HTTP attach. ## Function attach (mirrors POC #97) - Attachment name `{source}_to_wiki`; operator `http_generate` (configurable). - `params`: `{ endpoint_url, source_collection, source_kind }` — `endpoint_url` defaults to the modal URL from the POC; `source_collection`/`source_kind` are the source name. - `min_records_for_invocation` defaults to 100 (matches the chroma frontend default). - **No `seed_output_collection` step** — per @HammadB, the output dimension is already hardcoded to 1024 in `/init`'s collection creation (chroma #7127, already on main). - **Idempotent**: `AlreadyExists` / `CollectionAlreadyHasFunction` are treated as success, so `/init` stays safe to call repeatedly. ## Shared constant The chunk-sibling flag key is promoted to `chroma_types::CHROMA_GROUP_CHUNK_SIBLINGS_KEY` so the reader (worker `PartitionOperator`) and writer (`/init`) share one definition; `partition_log.rs` re-exports it. ## Wiki collections deliberately untouched Wiki/wiki_revisions are the function's *output* — no chunk-sibling flag, no attach. The marker mechanism operates on the source/input side. ## Caveat: get-or-create idempotency `/init` uses get-or-create for collections. If a source collection **already exists** without the flag (e.g. created by an earlier upload), the metadata isn't retroactively updated. `/init` must run before the first upload (it's the bootstrap). The function attach is independently idempotent. Pre-existing source collections would need a one-off metadata backfill — out of scope here. ## Test plan - [x] `cargo check -p foundation-api`, `cargo check -p chroma-types` pass. - [x] `cargo test -p foundation-api --lib routes::init` — 4/4 pass. - [ ] **CI must run the worker suite** — the `partition_log.rs` re-export couldn't be compiled locally (Homebrew rustc 1.94.1 vs pinned 1.92.0; `wal3` fails under 1.94.1 independent of this change). - [ ] End-to-end (post-#7133-merge): `/init` → source collections carry the flag + have `http_generate` attached → uploads chunk into them → attached function runs and observes the end-of-job marker last. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PartitionOperator::partition(rust/worker/src/execution/operators/partition_log.rs) groups log records byrecord.idso all operations on one id land in a single partition. The operator guarantees in-order processing within a partition, but not across partitions.Chunks of one document are distinct ids (
{base}-0,{base}-1, …), so today they can be split across partitions and materialized out of order relative to each other.This PR adds an opt-in mode (
chunk_grouping_key) that folds a trailing-{digits}chunk suffix onto its base, so a document's chunks share a partition and preserve per-document WAL order.Gating (per review decision)
The folding is off by default and enabled per collection via a new collection-metadata flag:
GROUP_CHUNK_SIBLINGS_METADATA_KEY="chroma:group_chunk_siblings". AMetadataValue::Bool(true)on the collection's metadata turns it on.PartitionInputcarries agroup_chunk_siblings: bool; the log-fetch orchestrator reads the flag from the collection metadata and threads it in.This replaces the original global approach in this PR's first revision. Gating avoids the parallelism cost for collections whose ids legitimately end in
-{digits}(e.g.report-2024) — they only get co-located if the collection opts in.Why
The foundation sync-status work (ADR 0001 §6, chroma-core/foundation) needs an end-of-job marker: the sync pipeline writes a trailing
Mutation::Updateto{base}-0after every chunk Create, and the foundation attached function treats observingjob_idon that record as "this upload is fully synthesized." That only works if the marker is observed after all sibling chunks — i.e. if the chunks process in WAL order, which requires them in the same partition. Foundation source collections will set the metadata flag; everything else is unaffected.Correctness
The folding only ever enlarges groups: a concrete id always maps to a single key, so the existing "all ops for one id in one partition" invariant is preserved. No same-id group is ever split. For opted-in collections, unrelated ids ending in
-{digits}are co-located (correctness-neutral; they remain distinct records).Test plan
test_chunk_grouping_key— fold / passthrough / edge cases (incl. thereport-2024fold).test_chunk_siblings_share_a_partition_when_opted_in— flag on: 3 chunks + 1 unrelated record atpartition_size = 1→ 2 partitions.test_chunk_siblings_not_grouped_when_opted_out— flag off (default): same input → 4 single-record partitions (unchanged behavior).wal3fails to compile under 1.94.1 (independently of this change).cargo fmtis clean; the edit is small and self-contained, but it has not been compiled locally.🤖 Generated with Claude Code