Skip to content

[ENH] Group chunk siblings into one compaction partition#7133

Merged
LLay merged 3 commits into
mainfrom
foundation_chunk_aware_partitioning
May 28, 2026
Merged

[ENH] Group chunk siblings into one compaction partition#7133
LLay merged 3 commits into
mainfrom
foundation_chunk_aware_partitioning

Conversation

@LLay
Copy link
Copy Markdown
Contributor

@LLay LLay commented May 26, 2026

Summary

PartitionOperator::partition (rust/worker/src/execution/operators/partition_log.rs) groups log records by record.id so all operations on one id land in a single partition. The operator guarantees in-order processing within a partition, but not across partitions.

Chunks of one document are distinct ids ({base}-0, {base}-1, …), so today they can be split across partitions and materialized out of order relative to each other.

This PR adds an opt-in mode (chunk_grouping_key) that folds a trailing -{digits} chunk suffix onto its base, so a document's chunks share a partition and preserve per-document WAL order.

Gating (per review decision)

The folding is off by default and enabled per collection via a new collection-metadata flag:

  • GROUP_CHUNK_SIBLINGS_METADATA_KEY = "chroma:group_chunk_siblings". A MetadataValue::Bool(true) on the collection's metadata turns it on.
  • PartitionInput carries a group_chunk_siblings: bool; the log-fetch orchestrator reads the flag from the collection metadata and threads it in.
  • The code that sets this flag on foundation source collections lands in a separate PR. Until that ships, no collection carries the flag, so this PR is a no-op for every existing collection — behavior is byte-identical to before.

This replaces the original global approach in this PR's first revision. Gating avoids the parallelism cost for collections whose ids legitimately end in -{digits} (e.g. report-2024) — they only get co-located if the collection opts in.

Why

The foundation sync-status work (ADR 0001 §6, chroma-core/foundation) needs an end-of-job marker: the sync pipeline writes a trailing Mutation::Update to {base}-0 after every chunk Create, and the foundation attached function treats observing job_id on that record as "this upload is fully synthesized." That only works if the marker is observed after all sibling chunks — i.e. if the chunks process in WAL order, which requires them in the same partition. Foundation source collections will set the metadata flag; everything else is unaffected.

Correctness

The folding only ever enlarges groups: a concrete id always maps to a single key, so the existing "all ops for one id in one partition" invariant is preserved. No same-id group is ever split. For opted-in collections, unrelated ids ending in -{digits} are co-located (correctness-neutral; they remain distinct records).

Test plan

  • Unit tests added:
    • test_chunk_grouping_key — fold / passthrough / edge cases (incl. the report-2024 fold).
    • test_chunk_siblings_share_a_partition_when_opted_in — flag on: 3 chunks + 1 unrelated record at partition_size = 1 → 2 partitions.
    • test_chunk_siblings_not_grouped_when_opted_out — flag off (default): same input → 4 single-record partitions (unchanged behavior).
  • CI must run these — I could not run the worker test suite locally: this environment's rustc (Homebrew 1.94.1) differs from the repo-pinned 1.92.0 and wal3 fails to compile under 1.94.1 (independently of this change). cargo fmt is clean; the edit is small and self-contained, but it has not been compiled locally.

🤖 Generated with Claude Code

The PartitionOperator groups log records by record id so that all
operations on one id land in a single partition (it guarantees
in-order processing within a partition, but not across partitions).
Chunks of one document are distinct ids (`{base}-0`, `{base}-1`, ...),
so today they can be split across partitions and processed out of
order relative to each other.

This changes the grouping key to fold a trailing `-{digits}` chunk
suffix onto its base, so a document's chunks share a partition and
preserve per-document WAL order. That lets a downstream consumer — the
foundation attached function — rely on observing a trailing end-of-job
marker written to `{base}-0` only after every sibling chunk (ADR 0001
§6 in chroma-core/foundation).

The change only ever enlarges groups: a concrete id always maps to a
single key, so the existing "all ops for one id in one partition"
invariant holds. It is correctness-neutral for unrelated ids that
happen to end in `-{digits}` (they stay distinct records), but folds
them into a shared partition, which can reduce compaction parallelism
for collections that use a prefix-then-number id scheme.

DRAFT pending worker/sysdb review of that trade-off — see PR body.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@github-actions
Copy link
Copy Markdown

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Per review, the chunk-sibling folding is now opt-in per collection
rather than global:

- New PartitionInput.group_chunk_siblings flag; partition() only folds
  `{base}-{idx}` onto a shared base when it's set, otherwise groups by
  the exact id (unchanged behavior).
- New GROUP_CHUNK_SIBLINGS_METADATA_KEY ("chroma:group_chunk_siblings")
  collection-metadata key. The log-fetch orchestrator reads it from the
  collection's metadata and passes the resulting bool into
  PartitionInput.
- The code that *sets* this flag on foundation source collections lands
  in a separate PR; until then no collection carries it and behavior is
  identical to before.

Tests: gated-on (siblings share a partition) and gated-off (siblings
stay distinct) cases, plus the chunk_grouping_key unit tests.
@LLay LLay marked this pull request as ready for review May 26, 2026 23:14
@LLay LLay requested review from HammadB and rescrv May 27, 2026 00:23
/// collections (foundation source collections) lands in a separate PR.
/// Until that ships, no collection carries the flag and this operator
/// behaves exactly as before.
pub const GROUP_CHUNK_SIBLINGS_METADATA_KEY: &str = "chroma:group_chunk_siblings";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be duplicated, we can reimport it from where it is defined in the other PR that introduces this into foundation

Copy link
Copy Markdown
Contributor Author

@LLay LLay May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the PR that introduces it.

But to tidy things: moved the canonical definition into chroma_types as CHROMA_GROUP_CHUNK_SIBLINGS_KEY (next to the other chroma: metadata-key constants in api_types.rs) and replaced the local const here with a re-export

Move CHROMA_GROUP_CHUNK_SIBLINGS_KEY into chroma_types (next to the other
chroma: metadata-key constants) and re-export it from partition_log as
GROUP_CHUNK_SIBLINGS_METADATA_KEY, instead of duplicating the string
literal in the worker. This gives the reader (worker PartitionOperator)
and the writer (foundation /init) a single shared definition.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
LLay added a commit that referenced this pull request May 27, 2026
Foundation's /init endpoint now also ensures the source collections
(slack, notion — configurable via CHROMA_FOUNDATION__SOURCE_COLLECTIONS)
and sets the `chroma:group_chunk_siblings` metadata flag on each. That
flag opts the collection into chunk-sibling grouping in the worker's
PartitionOperator, so a job's chunk records stay in one partition and
the trailing end-of-job marker on `{base}-0` is observed after every
sibling chunk (ADR 0001 §6 in chroma-core/foundation).

- Promote the flag key to a shared constant
  `chroma_types::CHROMA_GROUP_CHUNK_SIBLINGS_KEY` (next to the existing
  CHROMA_* metadata keys) so the reader (worker) and writer
  (foundation-api) share one definition. partition_log.rs now re-exports
  it.
- foundation-api: new `source_collections` config (default
  ["slack","notion"]); `ensure_collection` takes optional metadata;
  /init creates source collections with the flag and returns their ids.
  Wiki collections are the function's *output* and intentionally do NOT
  get the flag.

Stacked on the partition-operator change (chroma #7133), which reads
this flag. DRAFT — see PR body for the get-or-create idempotency
caveat.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
/// The flag is *read* here; the code that *sets* it on foundation source
/// collections lives in the foundation `/init` endpoint. Re-exported from
/// `chroma_types` so reader and writer share one definition.
pub use chroma_types::CHROMA_GROUP_CHUNK_SIBLINGS_KEY as GROUP_CHUNK_SIBLINGS_METADATA_KEY;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you really need this reexport?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose not! On my CLN TODO list

@LLay LLay merged commit cc25f33 into main May 28, 2026
115 of 121 checks passed
LLay added a commit that referenced this pull request May 28, 2026
…7134)

> **Draft.** Stacked on #7133 (the worker change that *reads* the
chunk-sibling flag). Merge #7133 first, then rebase this onto main.

## Summary

Extends the foundation `/init` endpoint to mirror the CLI POC
(chroma-core/foundation #97), so `/init` is the single bootstrap for a
team's foundation workspace. On top of the existing wiki +
wiki_revisions creation, `/init` now:

1. **Ensures the source collections** (slack, notion; configurable via
`CHROMA_FOUNDATION__SOURCE_COLLECTIONS`).
2. **Sets `chroma:group_chunk_siblings = true`** on each source
collection so the worker's `PartitionOperator` (#7133) keeps a job's
chunk records in one partition — the ordering the end-of-job marker
relies on (ADR 0001 §6).
3. **Attaches the foundation function** to each source collection via
`SysDb::create_attached_function`, with the wiki collection as output —
the server-side equivalent of the POC's HTTP attach.

## Function attach (mirrors POC #97)

- Attachment name `{source}_to_wiki`; operator `http_generate`
(configurable).
- `params`: `{ endpoint_url, source_collection, source_kind }` —
`endpoint_url` defaults to the modal URL from the POC;
`source_collection`/`source_kind` are the source name.
- `min_records_for_invocation` defaults to 100 (matches the chroma
frontend default).
- **No `seed_output_collection` step** — per @HammadB, the output
dimension is already hardcoded to 1024 in `/init`'s collection creation
(chroma #7127, already on main).
- **Idempotent**: `AlreadyExists` / `CollectionAlreadyHasFunction` are
treated as success, so `/init` stays safe to call repeatedly.

## Shared constant

The chunk-sibling flag key is promoted to
`chroma_types::CHROMA_GROUP_CHUNK_SIBLINGS_KEY` so the reader (worker
`PartitionOperator`) and writer (`/init`) share one definition;
`partition_log.rs` re-exports it.

## Wiki collections deliberately untouched

Wiki/wiki_revisions are the function's *output* — no chunk-sibling flag,
no attach. The marker mechanism operates on the source/input side.

## Caveat: get-or-create idempotency

`/init` uses get-or-create for collections. If a source collection
**already exists** without the flag (e.g. created by an earlier upload),
the metadata isn't retroactively updated. `/init` must run before the
first upload (it's the bootstrap). The function attach is independently
idempotent. Pre-existing source collections would need a one-off
metadata backfill — out of scope here.

## Test plan

- [x] `cargo check -p foundation-api`, `cargo check -p chroma-types`
pass.
- [x] `cargo test -p foundation-api --lib routes::init` — 4/4 pass.
- [ ] **CI must run the worker suite** — the `partition_log.rs`
re-export couldn't be compiled locally (Homebrew rustc 1.94.1 vs pinned
1.92.0; `wal3` fails under 1.94.1 independent of this change).
- [ ] End-to-end (post-#7133-merge): `/init` → source collections carry
the flag + have `http_generate` attached → uploads chunk into them →
attached function runs and observes the end-of-job marker last.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants