Skip to content

[ENH] /init: create source collections + attach foundation function#7134

Open
LLay wants to merge 3 commits into
foundation_chunk_aware_partitioningfrom
foundation_init_source_collections
Open

[ENH] /init: create source collections + attach foundation function#7134
LLay wants to merge 3 commits into
foundation_chunk_aware_partitioningfrom
foundation_init_source_collections

Conversation

@LLay
Copy link
Copy Markdown
Contributor

@LLay LLay commented May 26, 2026

Draft. Stacked on #7133 (the worker change that reads the chunk-sibling flag). Merge #7133 first, then rebase this onto main.

Summary

Extends the foundation /init endpoint to mirror the CLI POC (chroma-core/foundation #97), so /init is the single bootstrap for a team's foundation workspace. On top of the existing wiki + wiki_revisions creation, /init now:

  1. Ensures the source collections (slack, notion; configurable via CHROMA_FOUNDATION__SOURCE_COLLECTIONS).
  2. Sets chroma:group_chunk_siblings = true on each source collection so the worker's PartitionOperator ([ENH] Group chunk siblings into one compaction partition #7133) keeps a job's chunk records in one partition — the ordering the end-of-job marker relies on (ADR 0001 §6).
  3. Attaches the foundation function to each source collection via SysDb::create_attached_function, with the wiki collection as output — the server-side equivalent of the POC's HTTP attach.

Function attach (mirrors POC #97)

  • Attachment name {source}_to_wiki; operator http_generate (configurable).
  • params: { endpoint_url, source_collection, source_kind }endpoint_url defaults to the modal URL from the POC; source_collection/source_kind are the source name.
  • min_records_for_invocation defaults to 100 (matches the chroma frontend default).
  • No seed_output_collection step — per @HammadB, the output dimension is already hardcoded to 1024 in /init's collection creation (chroma [CHORE] Make foundation api /init use correct schema, index, dim #7127, already on main).
  • Idempotent: AlreadyExists / CollectionAlreadyHasFunction are treated as success, so /init stays safe to call repeatedly.

Shared constant

The chunk-sibling flag key is promoted to chroma_types::CHROMA_GROUP_CHUNK_SIBLINGS_KEY so the reader (worker PartitionOperator) and writer (/init) share one definition; partition_log.rs re-exports it.

Wiki collections deliberately untouched

Wiki/wiki_revisions are the function's output — no chunk-sibling flag, no attach. The marker mechanism operates on the source/input side.

Caveat: get-or-create idempotency

/init uses get-or-create for collections. If a source collection already exists without the flag (e.g. created by an earlier upload), the metadata isn't retroactively updated. /init must run before the first upload (it's the bootstrap). The function attach is independently idempotent. Pre-existing source collections would need a one-off metadata backfill — out of scope here.

Test plan

  • cargo check -p foundation-api, cargo check -p chroma-types pass.
  • cargo test -p foundation-api --lib routes::init — 4/4 pass.
  • CI must run the worker suite — the partition_log.rs re-export couldn't be compiled locally (Homebrew rustc 1.94.1 vs pinned 1.92.0; wal3 fails under 1.94.1 independent of this change).
  • End-to-end (post-[ENH] Group chunk siblings into one compaction partition #7133-merge): /init → source collections carry the flag + have http_generate attached → uploads chunk into them → attached function runs and observes the end-of-job marker last.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@LLay LLay changed the title [ENH] Set chunk-sibling grouping flag on foundation source collections [ENH] /init: create source collections + attach foundation function May 26, 2026
@blacksmith-sh

This comment has been minimized.

Comment thread rust/foundation-api/src/config.rs Outdated
"http_generate".to_string()
}
fn default_function_endpoint_url() -> String {
"https://chroma-core--foundation-research-generate-api.modal.run".to_string()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this static across environments including tilt?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good question. @HammadB can you advise?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will change eventually, this should not be in the code IMO and should default to an error

@LLay LLay marked this pull request as ready for review May 27, 2026 00:11
@LLay LLay requested a review from HammadB May 27, 2026 00:16
@HammadB
Copy link
Copy Markdown
Collaborator

HammadB commented May 27, 2026

Please address - #7134 (comment)

LLay and others added 2 commits May 27, 2026 09:02
Foundation's /init endpoint now also ensures the source collections
(slack, notion — configurable via CHROMA_FOUNDATION__SOURCE_COLLECTIONS)
and sets the `chroma:group_chunk_siblings` metadata flag on each. That
flag opts the collection into chunk-sibling grouping in the worker's
PartitionOperator, so a job's chunk records stay in one partition and
the trailing end-of-job marker on `{base}-0` is observed after every
sibling chunk (ADR 0001 §6 in chroma-core/foundation).

- Promote the flag key to a shared constant
  `chroma_types::CHROMA_GROUP_CHUNK_SIBLINGS_KEY` (next to the existing
  CHROMA_* metadata keys) so the reader (worker) and writer
  (foundation-api) share one definition. partition_log.rs now re-exports
  it.
- foundation-api: new `source_collections` config (default
  ["slack","notion"]); `ensure_collection` takes optional metadata;
  /init creates source collections with the flag and returns their ids.
  Wiki collections are the function's *output* and intentionally do NOT
  get the flag.

Stacked on the partition-operator change (chroma #7133), which reads
this flag. DRAFT — see PR body for the get-or-create idempotency
caveat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the CLI POC (chroma-core/foundation #97): after ensuring each
source collection, /init attaches the server-side function via
SysDb::create_attached_function, with the wiki collection as output.

- Attachment name `{source}_to_wiki`; operator `http_generate`
  (configurable); params carry the modal `endpoint_url`,
  `source_collection`, and `source_kind`.
- New FoundationConfig fields: function_name, function_endpoint_url,
  min_records_for_invocation (defaults mirror the POC + the chroma
  frontend's 100-record default). Output dimension is already hardcoded
  to 1024 in /init (chroma #7127), so no seed_output_collection step is
  needed.
- Idempotent: AlreadyExists / CollectionAlreadyHasFunction are treated
  as success so /init stays safe to call repeatedly.
@LLay LLay force-pushed the foundation_init_source_collections branch from a540171 to a78a010 Compare May 27, 2026 16:03
Drop default_function_endpoint_url and its hardcoded modal.run default.
function_endpoint_url is now Option<String> defaulting to None, and
/init errors (MissingFunctionEndpointUrl -> 500) when the attached
function needs it but the deploy left it unset — so a misconfigured
deployment fails loudly instead of silently pointing at a baked-in
POC endpoint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants