fix: scope retry health by model and region by steebchen · Pull Request #1977 · theopenco/llmgateway

steebchen · 2026-04-06T18:08:29Z

What changed

This PR scopes routing health and retry decisions to the unit they actually describe (model + region + key) and recovers from a class of provider failures that previously slipped past retry.

Retry on invalid-key 400 payloads. Some providers surface invalid credentials with status 400 and a body like API key not valid. These are now classified as auth failures, so same-provider retry rotates to the next configured key and the bad credential is excluded for the rest of the request. Centralized in a new provider-auth-errors.ts.
Per-model API key health. Tracked-key health and env-based round-robin penalties are now scoped per (key, model) instead of per key. A bad key on gpt-4 no longer fences claude-3-5-sonnet traffic on the same key.
Region-consistent direct-provider key selection. The two early region-lock lookups in chat.ts now go through the same model-scoped findProviderKey(...) path used at request time, so usedRegion ??= cannot pin one key's region to another key's token after failover.
Same-provider, same-region retry aggregation. Worker mapping aggregation now suppresses a failed attempt from model_provider_mapping_history only when the retry recovers on the same provider AND same region. A singapore failure recovered by cn-beijing is no longer dropped from singapore's mapping uptime.
Spec hardening. Mock servers wait until they're listening before tests issue requests; touched gateway specs are less flaky under local and CI timing variance.

Impact

Explicit-provider requests recover from invalid-key payloads by retrying another configured key for the same provider.
Key selection and uptime penalties are isolated per model.
Region-aware routing stays consistent with the scoped key actually selected after failover.
Provider mapping uptime no longer hides failed regional attempts when the recovery happens in a different region.

Main areas changed

apps/gateway/src/chat/chat.ts
apps/gateway/src/chat/tools/get-finish-reason-from-error.ts
apps/gateway/src/chat/tools/get-provider-env.ts
apps/gateway/src/chat/tools/resolve-provider-context.ts
apps/gateway/src/chat/tools/retry-with-fallback.ts
apps/gateway/src/lib/api-key-health.ts
apps/gateway/src/lib/cached-queries.ts
apps/gateway/src/lib/provider-auth-errors.ts (new)
apps/gateway/src/lib/round-robin-env.ts
apps/worker/src/services/stats-calculator.ts
Gateway and worker specs covering invalid-key retries, model-scoped key health, region-scoped direct-provider routing, and retry aggregation

Diff against origin/main: ~854 insertions / ~79 deletions across 21 files.

Validation

pnpm vitest run apps/gateway/src/fallback.spec.ts apps/gateway/src/chat/tools/retry-with-fallback.spec.ts apps/worker/src/services/stats-calculator.spec.ts --no-file-parallelism
pnpm format, pnpm lint, pnpm build
pnpm test:unit — all PR-related tests pass

Summary by CodeRabbit

New Features
- Key-health and token selection can be scoped by model, isolating selection per model/region.
- New credential-detection and retry helper to better identify invalid provider credentials.
Bug Fixes
- Improved alternate-key retry logic for authentication and credential-error cases.
- More reliable regional failover and routing with model-scoped key health and logging.
Tests
- Expanded unit/integration tests for key-health, failover, retry, routing, and shutdown behavior.

coderabbitai · 2026-04-06T18:08:37Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Model-scoped API-key health/selection added via a new optional selectionScope (sourced from baseModelName); credential-error detection and shouldRetryAlternateKey were introduced; key-selection, round-robin, and health APIs were updated and tests/mocks adapted to exercise scoped behavior and retry rules.

Changes

Cohort / File(s)	Summary
Credential Error Detection `apps/gateway/src/lib/provider-auth-errors.ts`	Add regex patterns and exported `hasInvalidProviderCredentialError(errorText?: string): boolean`.
Error Classification `apps/gateway/src/chat/tools/get-finish-reason-from-error.ts`, `apps/gateway/src/chat/tools/get-finish-reason-from-error.spec.ts`	Treat invalid-provider-credential messages as `gateway_error` (in addition to 401/403); add test for 400+invalid-key payload.
Key Health API & Logic `apps/gateway/src/lib/api-key-health.ts`, `apps/gateway/src/lib/api-key-health.spec.ts`	Replace inline message checks with `hasInvalidProviderCredentialError`; append optional `selectionScope` to health keys; add `selectionScope?: string` to public health APIs; add test validating permanent blacklist on invalid-key text.
Round-Robin Selection `apps/gateway/src/lib/round-robin-env.ts`, `apps/gateway/src/lib/round-robin-env.spec.ts`	Add optional `selectionScope` to `getRoundRobinValue`/`peekRoundRobinValue` and forward into health lookups; add model-scoped key-health test.
Cached Queries / Provider Lookup `apps/gateway/src/lib/cached-queries.ts`, `apps/gateway/src/lib/cached-queries.spec.ts`	Rename optional `_selectionKey`→`selectionScope` in `findProviderKey`/`findCustomProviderKey`; thread `selectionScope` into `selectProviderKeyWithFailover` and tests.
Get Provider Env `apps/gateway/src/chat/tools/get-provider-env.ts`, `apps/gateway/src/chat/tools/get-provider-env.spec.ts`	Add `selectionScope` to `GetProviderEnvOptions`; forward scope into round-robin selection; test verifies different scopes yield different indices.
Retry / Fallback Logic `apps/gateway/src/chat/tools/retry-with-fallback.ts`, `apps/gateway/src/chat/tools/retry-with-fallback.spec.ts`	Add exported `shouldRetryAlternateKey(errorType,statusCode?,errorText?)` to allow alternate-key retries for retryable errors, 401/403 gateway errors, or invalid-credential messages; add tests.
Provider Context & Chat Flow `apps/gateway/src/chat/tools/resolve-provider-context.ts`, `apps/gateway/src/chat/chat.ts`	Use `baseModelName` as `selectionScope` for provider-key lookups; pass `selectionScope` into `getProviderEnv`; thread `baseModelName` into key-health reporting and alternate-key tracking.
API / Integration Tests & Mocks `apps/gateway/src/api.spec.ts`, `apps/gateway/src/fallback.spec.ts`, `apps/gateway/src/graceful-shutdown.spec.ts`, `apps/gateway/src/test-utils/mock-openai-server.ts`	Update SSE test to check `selectionScope`-scoped metrics; add auth/invalid-key retry tests; add `waitForServerListening` helper; mock server supports `TRIGGER_FAIL_ONCE_INVALID_KEY`.
Worker Stats Aggregation `apps/worker/src/services/stats-calculator.ts`, `apps/worker/src/services/stats-calculator.spec.ts`	Add SQL predicates to exclude same-provider recovered retries (and region-aware variant) from minute aggregations; add tests validating inclusion/exclusion rules.
Misc tests / plumbing `apps/gateway/src/chat/tools/get-provider-env.spec.ts`, `apps/gateway/src/chat/tools/retry-with-fallback.spec.ts`, `apps/gateway/src/lib/cached-queries.spec.ts`, `apps/gateway/src/lib/round-robin-env.spec.ts`, `apps/gateway/src/lib/api-key-health.spec.ts`, `apps/gateway/src/chat/chat.ts`	Add/adjust unit and integration tests and update chat flow call sites to propagate `selectionScope` and updated retry/health APIs.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ChatHandler as Chat Handler
    participant KeySelector as Provider Key Selector
    participant HealthTracker as Health Tracker
    participant Provider as Upstream Provider

    Client->>ChatHandler: Request (includes baseModelName)
    ChatHandler->>KeySelector: findProviderKey(org, provider, selectionScope: baseModelName)
    KeySelector->>HealthTracker: isTrackedKeyHealthy(keyId, selectionScope: baseModelName)
    HealthTracker-->>KeySelector: healthy? (scoped)
    KeySelector-->>ChatHandler: primaryKey

    ChatHandler->>Provider: Send request with primaryKey
    Provider-->>ChatHandler: Error (401/403 or 400+invalid-key or other)

    ChatHandler->>ChatHandler: shouldRetryAlternateKey(errorType, statusCode, errorText)?
    alt Retry Eligible
        ChatHandler->>KeySelector: findProviderKey(..., selectionScope: baseModelName, excludedKeyIds: [primaryKey])
        KeySelector->>HealthTracker: isTrackedKeyHealthy(alternateKeyId, selectionScope: baseModelName)
        HealthTracker-->>KeySelector: healthy? (scoped)
        KeySelector-->>ChatHandler: alternateKey

        ChatHandler->>HealthTracker: reportTrackedKeyError(primaryKeyId, statusCode, errorText, selectionScope: baseModelName)
        ChatHandler->>Provider: Retry with alternateKey
        Provider-->>ChatHandler: Success
        ChatHandler->>HealthTracker: reportTrackedKeySuccess(alternateKeyId, selectionScope: baseModelName)
    else Not Eligible
        ChatHandler->>HealthTracker: reportTrackedKeyError(primaryKeyId, statusCode, errorText, selectionScope: baseModelName)
        ChatHandler-->>Client: Return error
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~90 minutes

Possibly related PRs

feat(gateway): improve retry routing and request logging #1973 — Overlapping changes to provider key health, selectionScope plumbing, and key-selection APIs; strongly related at code level.
feat(azure): add gpt-image-2 via Azure OpenAI #2102 — Related change to finish-reason/error-classification logic (both modify get-finish-reason-from-error behavior).

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.52% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: scope retry health by model and region' clearly and concisely summarizes the main change: implementing model and region-based scoping for retry health tracking across the codebase.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/retry-invalid-api-key-payloads

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…-key-payloads

Copilot

Pull request overview

This PR refines retry/health behavior across the gateway and worker so that (1) invalid-credential failures are detected even when upstream responds with 400, (2) API key health is tracked per-model (not globally per key), and (3) provider routing health stats aren’t degraded by same-provider retries that ultimately recover.

Changes:

Add invalid-provider-credential text detection and use it in finish-reason classification, retry decisions, and key blacklisting.
Scope key health tracking/selection by model via a selectionScope plumbed through env-key and DB-key selection paths.
Exclude same-provider recovered retry attempts from worker-generated routing health statistics, with supporting tests.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
apps/worker/src/services/stats-calculator.ts	Filters out recovered same-provider retry attempts from minute-level stats aggregation.
apps/worker/src/services/stats-calculator.spec.ts	Adds coverage ensuring recovered same-provider retries don’t degrade mapping/model health stats.
apps/gateway/src/test-utils/mock-openai-server.ts	Adds a mock scenario for invalid-key payloads returned as `400` on first attempt.
apps/gateway/src/lib/round-robin-env.ts	Threads `selectionScope` into env key selection so health is isolated by scope.
apps/gateway/src/lib/round-robin-env.spec.ts	Tests env-key health isolation per model scope.
apps/gateway/src/lib/provider-auth-errors.ts	New helper for detecting invalid credential error payload patterns.
apps/gateway/src/lib/cached-queries.ts	Threads `selectionScope` into DB provider-key selection and health checks.
apps/gateway/src/lib/cached-queries.spec.ts	Tests tracked-key health isolation per model scope.
apps/gateway/src/lib/api-key-health.ts	Adds model-scoped health keys and uses invalid-credential text detection for permanent blacklisting.
apps/gateway/src/lib/api-key-health.spec.ts	Adds a test for permanently blacklisting invalid-key text on an otherwise-ignored `4xx`.
apps/gateway/src/graceful-shutdown.spec.ts	Stabilizes tests by waiting for the server to be listening before making requests.
apps/gateway/src/fallback.spec.ts	Adds fallback tests for same-provider key rotation on auth failures and invalid-key payloads; resets key health between tests.
apps/gateway/src/chat/tools/retry-with-fallback.ts	Introduces `shouldRetryAlternateKey` to allow same-provider key rotation for auth/invalid-key payloads.
apps/gateway/src/chat/tools/retry-with-fallback.spec.ts	Adds unit tests for `shouldRetryAlternateKey`.
apps/gateway/src/chat/tools/resolve-provider-context.ts	Passes base model as `selectionScope` into env/key selection.
apps/gateway/src/chat/tools/get-provider-env.ts	Accepts `selectionScope` and forwards it into round-robin env key selection.
apps/gateway/src/chat/tools/get-provider-env.spec.ts	Tests that `selectionScope` affects env key health/selection.
apps/gateway/src/chat/tools/get-finish-reason-from-error.ts	Treats invalid-key payload text as credential failure (gateway_error) even for `400`.
apps/gateway/src/chat/tools/get-finish-reason-from-error.spec.ts	Adds coverage for invalid-key payloads on `400`.
apps/gateway/src/chat/chat.ts	Plumbs model scope into key health reporting and uses `shouldRetryAlternateKey` for same-provider retries.
apps/gateway/src/api.spec.ts	Updates tracked-key health assertions to use scoped health keys.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T19:20:52Z

@@ -184,7 +185,7 @@ export async function findOrganizationById(
 export async function findCustomProviderKey(
 	organizationId: string,
 	customProviderName: string,
-	_selectionKey?: string,
+	selectionScope?: string,
 	excludedKeyIds?: ReadonlySet<string>,
 ): Promise<ProviderKey | undefined> {
 	const results = await db
@@ -199,7 +200,7 @@ export async function findCustomProviderKey(
 			),
 		)
 		.orderBy(asc(providerKeyTable.createdAt), asc(providerKeyTable.id));
-	return selectProviderKeyWithFailover(results, excludedKeyIds);
+	return selectProviderKeyWithFailover(results, selectionScope, excludedKeyIds);
 }

 /**
@@ -208,7 +209,7 @@ export async function findCustomProviderKey(
 export async function findProviderKey(
 	organizationId: string,
 	provider: string,
-	_selectionKey?: string,
+	selectionScope?: string,
 	excludedKeyIds?: ReadonlySet<string>,
 ): Promise<ProviderKey | undefined> {
 	const results = await db
@@ -222,7 +223,7 @@ export async function findProviderKey(
 			),
 		)
 		.orderBy(asc(providerKeyTable.createdAt), asc(providerKeyTable.id));
-	return selectProviderKeyWithFailover(results, excludedKeyIds);
+	return selectProviderKeyWithFailover(results, selectionScope, excludedKeyIds);


findProviderKey / findCustomProviderKey now treat the 3rd argument as selectionScope (used to scope in-memory key health buckets). There are still call sites in the repo passing requestId/selectionKey as the 3rd argument (e.g. apps/gateway/src/moderations/moderations.ts and apps/gateway/src/videos/videos.ts), which will accidentally scope health per-request and can cause unbounded keyHealthMap growth + inconsistent failover behavior. Please update those call sites to pass a stable scope (e.g. base model name) or undefined, and if request-based cache-busting is still needed, reintroduce it as a separate parameter rather than overloading selectionScope.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b45c134a5d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-07T19:21:08Z

+			metrics: getTrackedKeyMetrics(item.id, selectionScope),
 		}))
-		.filter(({ item }) => isTrackedKeyHealthy(item.id));
+		.filter(({ item }) => isTrackedKeyHealthy(item.id, selectionScope));


Restore backward-compatible selection-key semantics

This change makes selectionScope participate in key-health lookups (getTrackedKeyMetrics / isTrackedKeyHealthy), but the third argument on findProviderKey/findCustomProviderKey was previously an ignored _selectionKey, and some callers still pass per-request IDs (for example apps/gateway/src/moderations/moderations.ts around lines 380-401 and apps/gateway/src/videos/videos.ts around line 2164). Because those values are unique per request while health reporting there remains unscoped, key selection effectively sees a fresh healthy bucket every time and can keep choosing bad keys instead of failing over.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

apps/gateway/src/chat/chat.ts (2)
2694-2724: ⚠️ Potential issue | 🟠 Major

Keep the tracked key identity in sync when a regional env token replaces the selected credential.

Lines 2719-2724, 2768-2773, 2794-2799, and 2845-2850 can swap usedToken to a region-specific env value, but providerKey, envVarName, and configIndex still describe the original DB/base key. The later rememberFailedKey / reportKey* calls will quarantine or heal the wrong credential, and same-provider retry can keep reusing the same bad regional token while rotating unrelated keys.

Also applies to: 2761-2774, 2778-2800, 2838-2851
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/chat/chat.ts` around lines 2694 - 2724, The region-specific
env token replacement swaps usedToken but leaves
providerKey/envVarName/configIndex pointing at the original DB key, causing
wrong keys to be reported or quarantined; update the logic in the blocks that
call getRegionSpecificEnvValue (around where usedRegion is set after
resolveRegionFromProviderKey) to also update the tracking identity: when
regionToken replaces usedToken, set providerKey.token (or create a new
providerKey-like object) and update envVarName and configIndex to reflect the
env-derived credential so subsequent calls to rememberFailedKey and reportKey*
operate on the actual credential in use; ensure the same change is applied to
all similar blocks (the ones you noted at 2719-2724, 2768-2773, 2794-2799,
2845-2850 and the other mirrored ranges).
4846-4860: ⚠️ Potential issue | 🟠 Major

Do not let recovered same-provider retries mark the provider as failed.

These branches now record failed attempts for alternate-key retries, but the later failedMap / providerScores enrichment still keys failures only by provider at Lines 5335-5351 and 8502-8516. If key A fails and key B succeeds on the same provider, the provider still ends up marked failed, so derived provider uptime is still penalized.
Possible follow-up
-const failedMap = new Map(
-	routingAttempts
-		.filter((a) => !a.succeeded)
-		.map((f) => [f.provider, f]),
-);
+const recoveredProviders = new Set(
+	routingAttempts.filter((a) => a.succeeded).map((a) => a.provider),
+);
+const failedMap = new Map(
+	routingAttempts
+		.filter((a) => !a.succeeded && !recoveredProviders.has(a.provider))
+		.map((f) => [f.provider, f]),
+);
Also applies to: 5123-5137, 8231-8241
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/chat/chat.ts` around lines 4846 - 4860, The current logic
calls rememberFailedKey during same-provider alternate-key retries which causes
the provider to be treated as failed even if another key on the same provider
later succeeds; update the flow so that when shouldRetryAlternateKey(...)
triggers and you call tryResolveAlternateKeyForCurrentProvider(true) you do NOT
mark the provider as failed up-front — instead only record the specific key
failure (or pass a flag to rememberFailedKey to avoid marking provider-level
failure), and only add entries to failedMap/providerScores (the provider-level
failure enrichment used later) when all keys for that provider have actually
failed or when there is no successful alternate key (i.e., check the result of
tryResolveAlternateKeyForCurrentProvider and only escalate to provider-level
failure if it returns falsy). Ensure changes touch the branches around
shouldRetryAlternateKey, rememberFailedKey,
tryResolveAlternateKeyForCurrentProvider and the later failedMap/providerScores
enrichment so provider uptime isn’t penalized when an alternate key recovers the
request.

🧹 Nitpick comments (1)

apps/gateway/src/lib/api-key-health.ts (1)
47-68: Consider monitoring memory growth with model-scoped health tracking.

With model-scoped health keys, keyHealthMap entries grow as O(keys × models) rather than O(keys). While MAX_HISTORY_SIZE and METRICS_WINDOW_MS bound per-entry memory, there's no mechanism to prune stale entries for models no longer in use.

For most deployments this is acceptable, but consider:

Monitoring getTrackedKeyCount() in production

Adding TTL-based eviction for entries with no recent activity if memory becomes a concern
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/lib/api-key-health.ts` around lines 47 - 68, The
keyHealthMap can grow unbounded when using model-scoped keys; add a TTL-based
eviction to prune entries with no recent activity: augment KeyHealth entries
with a lastActivity timestamp updated in recordError/recordSuccess, expose
getTrackedKeyCount() for monitoring, and implement a periodic cleanup task
(interval based on METRICS_WINDOW_MS or a new CLEANUP_INTERVAL_MS) that removes
map entries whose lastActivity is older than a configurable TTL (e.g.,
METRICS_WINDOW_MS or a new KEY_TTL_MS) while preserving MAX_HISTORY_SIZE
semantics; reference keyHealthMap, KeyHealth, recordError/recordSuccess,
getTrackedKeyCount, MAX_HISTORY_SIZE and METRICS_WINDOW_MS to locate where to
update and where to add the cleanup timer.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/gateway/src/chat/chat.ts`:
- Around line 2694-2704: Calls to findCustomProviderKey and findProviderKey are
missing the new selectionScope argument in several places; update every
invocation (including the earlier call near the top and the ones that correspond
to the referenced blocks) to pass the current selectionScope value so the key
lookups are region/selection-scoped. Specifically, add selectionScope as the
third parameter when calling findCustomProviderKey(project.organizationId,
customProviderName, selectionScope) and when calling
findProviderKey(project.organizationId, usedProvider, selectionScope) (and
similarly for the other direct-provider calls noted), ensuring you use the same
selectionScope variable used elsewhere in this file so custom-provider
validation and region locking are correctly scoped.

In `@apps/gateway/src/lib/provider-auth-errors.ts`:
- Around line 1-5: The current INVALID_PROVIDER_CREDENTIAL_PATTERNS only matches
human-readable message text and misses structured error codes like
"invalid_api_key"; update the pattern array
(INVALID_PROVIDER_CREDENTIAL_PATTERNS) to include a case-insensitive regex that
matches the error code token (e.g. /\binvalid_api_key\b/i or a JSON-ish code
match) so payloads carrying code: "invalid_api_key" trigger the same
auth-failure handling, and apply the same change to the other identical pattern
array found later in the file (the second occurrence at lines 12-14).

In `@apps/gateway/src/test-utils/mock-openai-server.ts`:
- Around line 770-785: The invalid-key trigger handler (checking userMessage for
"TRIGGER_FAIL_ONCE_INVALID_KEY") is colliding with the more general
"TRIGGER_FAIL_ONCE" handler and causing failOnceCounter to be incremented twice;
make the triggers mutually exclusive by changing the control flow so only one
handler runs: either (A) convert the generic "TRIGGER_FAIL_ONCE" handler to an
else if so it won’t run when "TRIGGER_FAIL_ONCE_INVALID_KEY" matched, or (B)
make the invalid-key check more specific (e.g., exact match or
word-boundary/unique token) so
userMessage.includes("TRIGGER_FAIL_ONCE_INVALID_KEY") cannot also match the
generic "TRIGGER_FAIL_ONCE"; adjust code around failOnceCounter, userMessage,
TRIGGER_FAIL_ONCE_INVALID_KEY and TRIGGER_FAIL_ONCE accordingly.

---

Outside diff comments:
In `@apps/gateway/src/chat/chat.ts`:
- Around line 2694-2724: The region-specific env token replacement swaps
usedToken but leaves providerKey/envVarName/configIndex pointing at the original
DB key, causing wrong keys to be reported or quarantined; update the logic in
the blocks that call getRegionSpecificEnvValue (around where usedRegion is set
after resolveRegionFromProviderKey) to also update the tracking identity: when
regionToken replaces usedToken, set providerKey.token (or create a new
providerKey-like object) and update envVarName and configIndex to reflect the
env-derived credential so subsequent calls to rememberFailedKey and reportKey*
operate on the actual credential in use; ensure the same change is applied to
all similar blocks (the ones you noted at 2719-2724, 2768-2773, 2794-2799,
2845-2850 and the other mirrored ranges).
- Around line 4846-4860: The current logic calls rememberFailedKey during
same-provider alternate-key retries which causes the provider to be treated as
failed even if another key on the same provider later succeeds; update the flow
so that when shouldRetryAlternateKey(...) triggers and you call
tryResolveAlternateKeyForCurrentProvider(true) you do NOT mark the provider as
failed up-front — instead only record the specific key failure (or pass a flag
to rememberFailedKey to avoid marking provider-level failure), and only add
entries to failedMap/providerScores (the provider-level failure enrichment used
later) when all keys for that provider have actually failed or when there is no
successful alternate key (i.e., check the result of
tryResolveAlternateKeyForCurrentProvider and only escalate to provider-level
failure if it returns falsy). Ensure changes touch the branches around
shouldRetryAlternateKey, rememberFailedKey,
tryResolveAlternateKeyForCurrentProvider and the later failedMap/providerScores
enrichment so provider uptime isn’t penalized when an alternate key recovers the
request.

---

Nitpick comments:
In `@apps/gateway/src/lib/api-key-health.ts`:
- Around line 47-68: The keyHealthMap can grow unbounded when using model-scoped
keys; add a TTL-based eviction to prune entries with no recent activity: augment
KeyHealth entries with a lastActivity timestamp updated in
recordError/recordSuccess, expose getTrackedKeyCount() for monitoring, and
implement a periodic cleanup task (interval based on METRICS_WINDOW_MS or a new
CLEANUP_INTERVAL_MS) that removes map entries whose lastActivity is older than a
configurable TTL (e.g., METRICS_WINDOW_MS or a new KEY_TTL_MS) while preserving
MAX_HISTORY_SIZE semantics; reference keyHealthMap, KeyHealth,
recordError/recordSuccess, getTrackedKeyCount, MAX_HISTORY_SIZE and
METRICS_WINDOW_MS to locate where to update and where to add the cleanup timer.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fa8012b2-8a2d-4667-8256-d0c644e96a9d

📥 Commits

Reviewing files that changed from the base of the PR and between 72569f6 and b45c134.

📒 Files selected for processing (21)

apps/gateway/src/api.spec.ts
apps/gateway/src/chat/chat.ts
apps/gateway/src/chat/tools/get-finish-reason-from-error.spec.ts
apps/gateway/src/chat/tools/get-finish-reason-from-error.ts
apps/gateway/src/chat/tools/get-provider-env.spec.ts
apps/gateway/src/chat/tools/get-provider-env.ts
apps/gateway/src/chat/tools/resolve-provider-context.ts
apps/gateway/src/chat/tools/retry-with-fallback.spec.ts
apps/gateway/src/chat/tools/retry-with-fallback.ts
apps/gateway/src/fallback.spec.ts
apps/gateway/src/graceful-shutdown.spec.ts
apps/gateway/src/lib/api-key-health.spec.ts
apps/gateway/src/lib/api-key-health.ts
apps/gateway/src/lib/cached-queries.spec.ts
apps/gateway/src/lib/cached-queries.ts
apps/gateway/src/lib/provider-auth-errors.ts
apps/gateway/src/lib/round-robin-env.spec.ts
apps/gateway/src/lib/round-robin-env.ts
apps/gateway/src/test-utils/mock-openai-server.ts
apps/worker/src/services/stats-calculator.spec.ts
apps/worker/src/services/stats-calculator.ts

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

apps/gateway/src/chat/chat.ts (1)
4848-4862: ⚠️ Potential issue | 🟠 Major

Recovered same-provider retries still mark the provider as failed.

These branches record a failed routingAttempt before rotating to another key on the same provider. Later, Line 5337 and Line 8504 derive providerScores.failed from any failed attempt keyed only by provider, so a request that recovers on the second key still leaves that provider flagged failed. That keeps the derived provider health/routing stats degraded even though the provider succeeded end-to-end.

Please exclude recovered same-provider retries from provider-level failure enrichment, or mark them separately so only unrecovered provider failures affect providerScores.

Also applies to: 4990-5007, 5125-5139, 5252-5269, 8233-8243, 8377-8394
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/chat/chat.ts` around lines 4848 - 4862, The branches that
call rememberFailedKey(usedProvider, ...) before rotating to another
same-provider key are marking the provider as failed even when
tryResolveAlternateKeyForCurrentProvider(true) recovers the request; change the
logic so recovered same-provider retries do not count as provider-level
failures: either pass a new flag (e.g., excludeProvider=true or keyOnly=true)
into rememberFailedKey when same-provider recovery succeeds, or add a
routingAttempt property (e.g., recoveredSameProvider) when recording attempts
and update the providerScores.failed derivation to ignore attempts with
recoveredSameProvider=true; apply the same change to the other listed
occurrences (around lines referenced: 4990-5007, 5125-5139, 5252-5269,
8233-8243, 8377-8394) to ensure only unrecovered provider failures affect
provider-level health.

♻️ Duplicate comments (1)

apps/gateway/src/chat/chat.ts (1)
1437-1440: ⚠️ Potential issue | 🟠 Major

Scope the early custom-provider validation lookup too.

Line 1437 still calls findCustomProviderKey(...) without the new selection scope, so this preflight can still consult global key health and reject a healthy key for the requested model. That leaves the custom-provider path inconsistent with the scoped lookups used later in actual key resolution.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/chat/chat.ts` around lines 1437 - 1440, The preflight call
to findCustomProviderKey (assigning customProviderKey) is unscoped and can
consult global keys; change this lookup to use the same scoped selection used
later in key resolution (i.e., call the scoped variant or pass the
selection/scope parameter such as organizationId and requested model/provider)
so the early validation only checks keys valid for the project context and
model; update the invocation that sets customProviderKey (and any related
preflight checks) to use the scoped lookup API used later in the resolution flow
(keep references: findCustomProviderKey, customProviderKey, customProviderName).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@apps/gateway/src/chat/chat.ts`:
- Around line 4848-4862: The branches that call rememberFailedKey(usedProvider,
...) before rotating to another same-provider key are marking the provider as
failed even when tryResolveAlternateKeyForCurrentProvider(true) recovers the
request; change the logic so recovered same-provider retries do not count as
provider-level failures: either pass a new flag (e.g., excludeProvider=true or
keyOnly=true) into rememberFailedKey when same-provider recovery succeeds, or
add a routingAttempt property (e.g., recoveredSameProvider) when recording
attempts and update the providerScores.failed derivation to ignore attempts with
recoveredSameProvider=true; apply the same change to the other listed
occurrences (around lines referenced: 4990-5007, 5125-5139, 5252-5269,
8233-8243, 8377-8394) to ensure only unrecovered provider failures affect
provider-level health.

---

Duplicate comments:
In `@apps/gateway/src/chat/chat.ts`:
- Around line 1437-1440: The preflight call to findCustomProviderKey (assigning
customProviderKey) is unscoped and can consult global keys; change this lookup
to use the same scoped selection used later in key resolution (i.e., call the
scoped variant or pass the selection/scope parameter such as organizationId and
requested model/provider) so the early validation only checks keys valid for the
project context and model; update the invocation that sets customProviderKey
(and any related preflight checks) to use the scoped lookup API used later in
the resolution flow (keep references: findCustomProviderKey, customProviderKey,
customProviderName).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 667a06f7-2da5-42a7-a473-1ea11e7d45da

📥 Commits

Reviewing files that changed from the base of the PR and between b45c134 and 86a36d2.

📒 Files selected for processing (4)

apps/gateway/src/chat/chat.ts
apps/gateway/src/fallback.spec.ts
apps/worker/src/services/stats-calculator.spec.ts
apps/worker/src/services/stats-calculator.ts

🚧 Files skipped from review as they are similar to previous changes (3)

apps/worker/src/services/stats-calculator.ts
apps/worker/src/services/stats-calculator.spec.ts
apps/gateway/src/fallback.spec.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 86a36d2c73

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-11T11:49:39Z

+	baseKey: string,
+	selectionScope?: string,
+): string {
+	return selectionScope ? `${baseKey}:${selectionScope}` : baseKey;


Cap health-scope cardinality

Using selectionScope directly in the map key makes health buckets unbounded, and these bucket keys are never evicted (only history arrays are pruned). In this commit, chat now passes model IDs as scope; for custom providers, parse-model-input accepts arbitrary model names from requests, so callers can create effectively unlimited distinct scopes and grow keyHealthMap without bound over time. This can steadily increase gateway memory usage under high-cardinality/custom-model traffic.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-20T20:23:50Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

# Conflicts: # apps/gateway/src/lib/cached-queries.ts

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector · 2026-04-27T18:24:06Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

apps/gateway/src/chat/chat.ts (1)
2766-2770: ⚠️ Potential issue | 🟠 Major

Scope the earlier custom-provider validation too.

These lookups are fixed, but the preflight at Line 1485 still calls findCustomProviderKey without a model scope. That means a custom key marked unhealthy for one model can still reject a different model before execution reaches this path.

Also applies to: 2850-2854
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/chat/chat.ts` around lines 2766 - 2770, The preflight
validation calls to findCustomProviderKey need to include the model scope so an
unhealthy custom provider for one model doesn't block another; update the
earlier preflight call(s) to pass the baseModelName (same string/variable used
where providerKey = await findCustomProviderKey(project.organizationId,
customProviderName, baseModelName) is used) and likewise adjust the other
occurrence around the later check (the call referenced near lines 2850-2854) to
include the baseModelName argument; ensure all invocations of
findCustomProviderKey use the three-argument form (organizationId,
customProviderName, baseModelName).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/gateway/src/chat/chat.ts`:
- Around line 2833-2835: getProviderEnv is selecting a scoped env bucket
(selectionScope: baseModelName) but later logic can swap usedToken for a
region-specific env token without updating envVarName or configIndex, causing
rememberFailedKey and the reportKey* calls to mark/heal the wrong (base)
credential; update the code paths in chat.ts where usedToken is replaced by a
regional token (also around the analogous block at lines ~2910-2912) to also
update the associated envVarName and configIndex variables to point at the
region-specific env entry so rememberFailedKey and
reportKeyFailure/reportKeyHeal operate on the actual credential used (ensure you
adjust any variables that identify the env key whenever you swap or override
usedToken).

---

Duplicate comments:
In `@apps/gateway/src/chat/chat.ts`:
- Around line 2766-2770: The preflight validation calls to findCustomProviderKey
need to include the model scope so an unhealthy custom provider for one model
doesn't block another; update the earlier preflight call(s) to pass the
baseModelName (same string/variable used where providerKey = await
findCustomProviderKey(project.organizationId, customProviderName, baseModelName)
is used) and likewise adjust the other occurrence around the later check (the
call referenced near lines 2850-2854) to include the baseModelName argument;
ensure all invocations of findCustomProviderKey use the three-argument form
(organizationId, customProviderName, baseModelName).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 23f749ad-446c-4273-92e5-e6051b485d0a

📥 Commits

Reviewing files that changed from the base of the PR and between d48275a and 90b4a5f.

📒 Files selected for processing (4)

apps/gateway/src/chat/chat.ts
apps/gateway/src/chat/tools/resolve-provider-context.ts
apps/gateway/src/fallback.spec.ts
apps/gateway/src/lib/cached-queries.ts

✅ Files skipped from review due to trivial changes (1)

apps/gateway/src/fallback.spec.ts

# Conflicts: # apps/gateway/src/chat/tools/get-finish-reason-from-error.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3e208ad8a5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-01T08:50:49Z

+			where final_retry_log.id = ${log.retriedByLogId}
+				and final_retry_log.used_provider = ${log.usedProvider}


Require successful retry before excluding failed logs

The new exclusion predicate drops any failed log that has a retriedByLogId pointing to a same-provider attempt, but it never checks whether that retry actually succeeded. In gateway logging, retriedByLogId is assigned when a retry is attempted, and the final retry log can still end with hasError = true (for example, if all retries fail or the final attempt is an empty/error result). In those cases this filter still removes the original failure, which inflates model/provider uptime and undercounts errors. Add a success condition on final_retry_log (e.g., has_error = false) before excluding.

Useful? React with 👍 / 👎.

Same-provider retry exclusion only checked the final retry log existed on the same provider/region, not that it succeeded. When all attempts on the same provider failed, the original failure was wrongly dropped from mapping history, hiding upstream errors in uptime stats. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2cb5875348

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-04T07:55:20Z

+	selectionScope?: string,
 ): void {
-	const healthKey = getTrackedHealthKey(keyId);
+	const healthKey = getTrackedHealthKey(keyId, selectionScope);


Keep auth-key blacklisting global across scopes

This now keys permanent auth failures by selectionScope, so a key that gets blacklisted after a 401/403 (or invalid-key payload) for one model is still considered healthy for other models. In multi-model traffic, the same invalid credential will be re-selected and fail again per scope, causing avoidable request failures and retries even though auth validity is provider-wide, not model-specific. The auth blacklist path should either write to an unscoped key or is*Healthy should also honor an unscoped blacklist.

Useful? React with 👍 / 👎.

Permanent auth blacklists were keyed by selectionScope, so a 401/403 or invalid-key payload on one model still let the same credential be picked for other models, causing avoidable failures and retries. Auth validity is provider-wide, so the permanent blacklist is now also recorded on an unscoped record and consulted by every scope's health check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 551fea0f8f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-04T08:20:10Z

 		uptime: calculateUptime(health, now),
 		totalRequests: health.history.length,
 		consecutiveErrors: health.consecutiveErrors,
-		permanentlyBlacklisted: health.permanentlyBlacklisted,
+		permanentlyBlacklisted:


Report zero uptime for globally blacklisted keys

When a scoped auth failure triggers markUnscopedPermanentBlacklist, later metric reads can still report high uptime because this branch always uses calculateUptime(health, now) whenever a scoped health record exists. In practice, getTrackedKeyMetrics (and the analogous getKeyMetrics) can return permanentlyBlacklisted: true with uptime: 100 for other scopes or the unscoped record, which corrupts health reporting and any consumers that rely on uptime values for diagnostics or ranking.

Useful? React with 👍 / 👎.

When an unscoped auth blacklist was set after a scope already had a healthy history, getKeyMetrics/getTrackedKeyMetrics returned the prior uptime alongside permanentlyBlacklisted=true, corrupting health reporting and any consumer that ranks by uptime. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3808c0e3e1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-04T11:00:35Z

+				and final_retry_log.used_provider = ${log.usedProvider}
+				and coalesce(final_retry_log.has_error, false) = false


Preserve cross-region failures in model history

excludeRecoveredSameProviderRetry() removes failed attempts whenever a same-provider retry succeeds, but it does not check whether the retry stayed in the same region. calculateModelHistoryForMinute now uses this predicate, so a failure in singapore that recovers in cn-beijing is dropped from model-level logs/errors, inflating model uptime and making model history inconsistent with the mapping-level regional stats (which correctly keep cross-region failures). Add a region-match condition (like the mapping query uses) before excluding the failed row.

Useful? React with 👍 / 👎.

# Conflicts: # apps/gateway/src/test-utils/mock-openai-server.ts

- Pass stable model identifier (not requestId) as selectionScope to findProviderKey/getProviderEnv in moderations and videos paths. Per-request scopes were causing unbounded keyHealthMap growth and defeating failover. - Update envVarName/configIndex (and clear providerKey) when usedToken is replaced by a region-specific env var, so health is attributed to the credential actually sent. - Match invalid_api_key code and 'Incorrect API key provided' wording in provider-auth-errors so structured payloads also rotate keys. - Aggregate model history with the region-aware retry exclusion so cross-region recoveries no longer mask same-region failures at the model level. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 403759bc35

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-04T20:54:13Z

+		const envResult = getProviderEnv("openai", {
+			selectionScope: upstreamModel,
+		});


Pass moderation scope into health updates

This change scopes moderation key selection by upstreamModel, but the subsequent reportKeyError/reportKeySuccess and tracked-key report calls still omit that scope, so all non-auth failures are written to unscoped buckets that isKeyHealthy(..., upstreamModel) and isTrackedKeyHealthy(..., upstreamModel) do not consult for temporary blacklist logic. In practice, moderation requests can keep reselecting a repeatedly failing key for the same model because scoped reads never see those consecutive-error updates.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0e0ce4702c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-06T19:15:48Z

+					usedToken = regionToken;
+					envVarName = regionEnvVarName;
+					configIndex = 0;
+					providerKey = undefined;


Preserve BYOK endpoint context when attributing env health

When a BYOK project has a provider key with baseUrl or provider options and a regional env var is also configured, this clears providerKey after swapping to the regional token. The later getProviderEndpoint(...) call uses providerKey?.baseUrl, providerKey?.options, and providerKey !== undefined, so those requests stop using the BYOK endpoint/deployment/options and are routed as if they were credits/env traffic. Keep the provider key for endpoint construction and track the actual credential source separately for health attribution.

Useful? React with 👍 / 👎.

Clearing providerKey when a regional env var replaced usedToken stripped the BYOK base URL/options/Azure deployment context from the downstream endpoint resolution, so requests stopped using the user's configured endpoint. Track a separate trackedKeyHealthId for tracked-key health attribution instead, and keep providerKey for endpoint construction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector · 2026-05-07T14:31:02Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

## What changed This PR scopes routing health and retry decisions to the unit they actually describe (model + region + key) and recovers from a class of provider failures that previously slipped past retry. - **Retry on invalid-key 400 payloads.** Some providers surface invalid credentials with status `400` and a body like `API key not valid`. These are now classified as auth failures, so same-provider retry rotates to the next configured key and the bad credential is excluded for the rest of the request. Centralized in a new `provider-auth-errors.ts`. - **Per-model API key health.** Tracked-key health and env-based round-robin penalties are now scoped per `(key, model)` instead of per key. A bad key on `gpt-4` no longer fences `claude-3-5-sonnet` traffic on the same key. - **Region-consistent direct-provider key selection.** The two early region-lock lookups in `chat.ts` now go through the same model-scoped `findProviderKey(...)` path used at request time, so `usedRegion ??=` cannot pin one key's region to another key's token after failover. - **Same-provider, same-region retry aggregation.** Worker mapping aggregation now suppresses a failed attempt from `model_provider_mapping_history` only when the retry recovers on the same provider AND same region. A `singapore` failure recovered by `cn-beijing` is no longer dropped from `singapore`'s mapping uptime. - **Spec hardening.** Mock servers wait until they're listening before tests issue requests; touched gateway specs are less flaky under local and CI timing variance. ## Impact - Explicit-provider requests recover from invalid-key payloads by retrying another configured key for the same provider. - Key selection and uptime penalties are isolated per model. - Region-aware routing stays consistent with the scoped key actually selected after failover. - Provider mapping uptime no longer hides failed regional attempts when the recovery happens in a different region. ## Main areas changed - `apps/gateway/src/chat/chat.ts` - `apps/gateway/src/chat/tools/get-finish-reason-from-error.ts` - `apps/gateway/src/chat/tools/get-provider-env.ts` - `apps/gateway/src/chat/tools/resolve-provider-context.ts` - `apps/gateway/src/chat/tools/retry-with-fallback.ts` - `apps/gateway/src/lib/api-key-health.ts` - `apps/gateway/src/lib/cached-queries.ts` - `apps/gateway/src/lib/provider-auth-errors.ts` (new) - `apps/gateway/src/lib/round-robin-env.ts` - `apps/worker/src/services/stats-calculator.ts` - Gateway and worker specs covering invalid-key retries, model-scoped key health, region-scoped direct-provider routing, and retry aggregation Diff against `origin/main`: ~854 insertions / ~79 deletions across 21 files. ## Validation - `pnpm vitest run apps/gateway/src/fallback.spec.ts apps/gateway/src/chat/tools/retry-with-fallback.spec.ts apps/worker/src/services/stats-calculator.spec.ts --no-file-parallelism` - `pnpm format`, `pnpm lint`, `pnpm build` - `pnpm test:unit` — all PR-related tests pass  ## Summary by CodeRabbit * **New Features** * Key-health and token selection can be scoped by model, isolating selection per model/region. * New credential-detection and retry helper to better identify invalid provider credentials. * **Bug Fixes** * Improved alternate-key retry logic for authentication and credential-error cases. * More reliable regional failover and routing with model-scoped key health and logging. * **Tests** * Expanded unit/integration tests for key-health, failover, retry, routing, and shutdown behavior.  --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

steebchen changed the title ~~fix: retry invalid api key payloads~~ fix: isolate key and provider health Apr 6, 2026

steebchen added 3 commits April 7, 2026 14:36

fix: retry invalid api key payloads

b8579f3

fix: isolate key and provider health

8391b63

test: harden flaky gateway specs

b49cce3

steebchen force-pushed the fix/retry-invalid-api-key-payloads branch from 99f947c to b49cce3 Compare April 7, 2026 07:52

Merge remote-tracking branch 'origin/main' into fix/retry-invalid-api…

b45c134

…-key-payloads

steebchen marked this pull request as ready for review April 7, 2026 19:15

Copilot AI review requested due to automatic review settings April 7, 2026 19:15

Copilot started reviewing on behalf of steebchen April 7, 2026 19:16 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 7, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread apps/gateway/src/chat/chat.ts

Comment thread apps/gateway/src/lib/provider-auth-errors.ts

Comment thread apps/gateway/src/test-utils/mock-openai-server.ts Outdated

fix: tighten retry health scoping

86a36d2

steebchen changed the title ~~fix: isolate key and provider health~~ fix: scope retry health by model and region Apr 11, 2026

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 11, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into pr-1977-review

d48275a

steebchen and others added 2 commits April 28, 2026 01:03

Merge remote-tracking branch 'origin/main' into pr-1977-review

f4de254

# Conflicts: # apps/gateway/src/lib/cached-queries.ts

test: rename together.ai to together-ai in retry tests

90b4a5f

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread apps/gateway/src/chat/chat.ts

steebchen self-assigned this Apr 30, 2026

Merge remote-tracking branch 'origin/main' into pr-1977-review

3e208ad

# Conflicts: # apps/gateway/src/chat/tools/get-finish-reason-from-error.ts

chatgpt-codex-connector Bot reviewed May 1, 2026

View reviewed changes

steebchen and others added 2 commits May 4, 2026 14:31

Merge remote-tracking branch 'origin/main' into pr-1977-review

17e7932

chatgpt-codex-connector Bot reviewed May 4, 2026

View reviewed changes

steebchen and others added 2 commits May 5, 2026 03:15

Merge remote-tracking branch 'origin/main' into pr-1977-review

8670ffa

# Conflicts: # apps/gateway/src/test-utils/mock-openai-server.ts

chatgpt-codex-connector Bot reviewed May 4, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into pr-1977-review

0e0ce47

chatgpt-codex-connector Bot reviewed May 6, 2026

View reviewed changes

steebchen and others added 2 commits May 7, 2026 18:44

Merge branch 'main' into fix/retry-invalid-api-key-payloads

81ef46b

steebchen enabled auto-merge May 7, 2026 14:30

steebchen added this pull request to the merge queue May 7, 2026

Merged via the queue into main with commit 79f9329 May 7, 2026
16 of 17 checks passed

steebchen deleted the fix/retry-invalid-api-key-payloads branch May 7, 2026 14:48

		where final_retry_log.id = ${log.retriedByLogId}
		and final_retry_log.used_provider = ${log.usedProvider}

		and final_retry_log.used_provider = ${log.usedProvider}
		and coalesce(final_retry_log.has_error, false) = false

Conversation

steebchen commented Apr 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Impact

Main areas changed

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented Apr 20, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 27, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

steebchen commented Apr 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 6, 2026 •

edited

Loading