Enable live catalog with region-aware fallback engine#789
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
This PR replaces the C++ SDK’s embedded/static model catalog with a live Azure Foundry catalog client, adds automatic region detection from azureml-served-by-cluster, and introduces a region-fallback engine used by both catalog queries and model-registry SAS resolution so downloads target the region that served the catalog entry.
Changes:
- Removed the static snapshot catalog + snapshot generation tool, and always build/use the live Azure catalog client.
- Added
RegionFallback+ status-aware HTTP response plumbing to support regional retries and sticky-region behavior. - Stamped models with
detected_regionand used it to choose the model registry region when resolving download SAS URIs (with configurable fallback disable).
Reviewed changes
Copilot reviewed 32 out of 33 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk_v2/cpp/tools/catalog_snapshot/main.cc | Removed the snapshot generation tool entrypoint (static catalog pipeline retired). |
| sdk_v2/cpp/tools/catalog_snapshot/CMakeLists.txt | Removed CMake target for the snapshot tool. |
| sdk_v2/cpp/test/sdk_api/shared_test_env.h | Test env now always points at the live Azure catalog URL (no static path). |
| sdk_v2/cpp/test/internal_api/static_catalog_test.cc | Removed tests that exercised the embedded static catalog client. |
| sdk_v2/cpp/test/internal_api/region_fallback_test.cc | Added unit tests for region fallback chain building, retryability, sticky behavior, and exhaustion. |
| sdk_v2/cpp/test/internal_api/model_info_test.cc | Added JSON round-trip coverage for ModelInfo.detected_region. |
| sdk_v2/cpp/test/internal_api/download_test.cc | Updated registry client tests for status-aware HTTP + fallback; added download region-resolution tests. |
| sdk_v2/cpp/test/internal_api/catalog_cache_test.cc | Added cache save/load coverage for detected_region. |
| sdk_v2/cpp/test/internal_api/c_api_test.cc | Relaxed catalog population assertion to avoid network-dependent expectations. |
| sdk_v2/cpp/test/internal_api/azure_catalog_test.cc | Expanded live catalog client tests to cover region detection, URL rewriting, stamping, and fallback behavior. |
| sdk_v2/cpp/test/CMakeLists.txt | Updated unit test sources: added region_fallback_test, removed static_catalog_test, always includes azure_catalog_test. |
| sdk_v2/cpp/src/util/region_fallback.h | New public/internal utility interface for region-aware retry/fallback execution. |
| sdk_v2/cpp/src/util/region_fallback.cc | New implementation with proximal-region mapping, retryable classification, sticky region tracking, and diagnostics. |
| sdk_v2/cpp/src/model_info.h | Added detected_region field to model metadata. |
| sdk_v2/cpp/src/model_info.cc | Implemented JSON serialization/deserialization for detected_region (omitted when empty). |
| sdk_v2/cpp/src/manager.cc | Defaulted catalog region to auto and plumbed DisableRegionFallback into download + catalog components. |
| sdk_v2/cpp/src/http/http_client.h | Introduced HttpResponse and non-throwing Http*WithResponse APIs (status + headers + body). |
| sdk_v2/cpp/src/http/http_client.cc | Implemented response-aware GET/POST (including header capture + transport failure as status==0). |
| sdk_v2/cpp/src/download/model_registry_client.h | Updated registry client API to accept per-call region and use response-aware HTTP + fallback. |
| sdk_v2/cpp/src/download/model_registry_client.cc | Implemented region-fallback-driven registry resolution and parsing of response bodies. |
| sdk_v2/cpp/src/download/download_manager.h | Added config-region tracking and disable_region_fallback plumbing. |
| sdk_v2/cpp/src/download/download_manager.cc | Resolved registry region via explicit config → detected region → eastus; wired fallback-enabled registry client. |
| sdk_v2/cpp/src/catalog/static_catalog_client.cc | Removed embedded snapshot-backed catalog implementation. |
| sdk_v2/cpp/src/catalog/live_catalog_client_stub.cc | Removed stub that previously threw when live client sources were absent. |
| sdk_v2/cpp/src/catalog/catalog_snapshot_data.h | Removed snapshot data header (embedded bytes no longer part of build). |
| sdk_v2/cpp/src/catalog/catalog_client.h | Updated contract/docs: MakeCatalogClient now always constructs live Azure client and accepts region/fallback controls. |
| sdk_v2/cpp/src/catalog/catalog_client.cc | Retained cached-model + BYO synthesis helper; removed static-vs-live dispatch logic. |
| sdk_v2/cpp/src/catalog/azure_model_catalog.h | Default catalog URL is now always Azure; added region + fallback configuration fields. |
| sdk_v2/cpp/src/catalog/azure_model_catalog.cc | Plumbed region/fallback into MakeCatalogClient and persisted fetched model infos (incl. region) to cache. |
| sdk_v2/cpp/src/catalog/azure_catalog_client.h | New live Azure catalog client interface with region detection and fallback-aware fetching. |
| sdk_v2/cpp/src/catalog/azure_catalog_client.cc | New live Azure catalog implementation: request shaping, region detection, URL rewriting, filter-set atomicity, and fallback. |
| sdk_v2/cpp/CMakeLists.txt | Removed tools option + static snapshot wiring; always builds live catalog + region fallback sources. |
636f28d to
a827ec8
Compare
| const auto fallback_it = config_.additional_options.find("DisableRegionFallback"); | ||
| if (fallback_it != config_.additional_options.cend()) { | ||
| const auto value = to_lower(fallback_it->second); | ||
| disable_region_fallback = (value == "true" || value == "1" || value == "yes"); |
There was a problem hiding this comment.
Not necessary to change in this PR, but no other parameter that gets parsed from the config accepts "yes". I think it's a good idea to keep here, but maybe we should standardize what truthy-values get accepted with a helper function somewhere
| const std::string& json_body, | ||
| const std::string& user_agent, | ||
| std::chrono::milliseconds timeout, | ||
| bool close_connection) { |
There was a problem hiding this comment.
Nit: Maybe consider changing the order of close_connection and timeout here and in HttpGetWithResponse to align more with HttpRequestRaw
Or, maybe there should be an HttpRequestOptions struct that gets passed in to store user_agent, timeout, and close_connection? Maybe it's fine as is if we don't anticipate any more arguments being added in the future.
| /// `status == 0` becomes "transport failure"; otherwise "HTTP <status>". When the body is | ||
| /// non-empty it is appended (truncated to `max_body_chars`) so server-side or transport | ||
| /// diagnostics are preserved without bloating logs. | ||
| std::string DescribeFailure(const HttpResponse& response, std::size_t max_body_chars = 512); |
There was a problem hiding this comment.
Makes sense for consumers; I will need to log ErrorEvent in here when telemetry is added instead of logging later and missing out on some information
Replaces the static, file-based model catalog with a live Azure Foundry catalog client, and makes both catalog queries and model downloads region-aware. The client now detects the caller's Azure region from cluster headers, queries that region's catalog endpoint, and transparently retries against nearby regions when an endpoint is unhealthy. Downloads resolve their SAS URI from the same region that served the model's catalog entry.
This ports three behaviors from the C# core into the C++ SDK:
azureml-served-by-clusterheaders (instead of always usingeastus).How it works
catalog_region:""/"auto"→ detect from headers; any other value → explicit override.DisableRegionFallback: disables cross-region retries (single-region behavior).