Skip to content

feat: forward openai prompt cache controls#2201

Merged
steebchen merged 2 commits intotheopenco:mainfrom
RATCHAW:feat/openai-prompt-cache-controls
May 7, 2026
Merged

feat: forward openai prompt cache controls#2201
steebchen merged 2 commits intotheopenco:mainfrom
RATCHAW:feat/openai-prompt-cache-controls

Conversation

@RATCHAW
Copy link
Copy Markdown
Contributor

@RATCHAW RATCHAW commented May 7, 2026

Summary

Expose OpenAI prompt-cache request parameters (prompt_cache_key, prompt_cache_retention) on /v1/chat/completions and /v1/responses, and forward them to OpenAI per the prompt caching docs.

Scope note

This is not a cache-write billing change.

OpenAI does not surcharge cache writes — first-write tokens cost the same as regular input tokens, and OpenAI’s response only reports cache reads (prompt_tokens_details.cached_tokens), which the gateway already extracts and bills correctly in HEAD.

What’s missing without this PR is the ability for clients to pass OpenAI’s two opt-in cache control hints:

Param Purpose
prompt_cache_key Routing hint that improves cache hit rate for requests sharing a long common prefix
prompt_cache_retention "in_memory" (default, ~5–10 min) or "24h" extended retention on eligible models

Before this change, sending either field returned a schema error.

Changes

  • Schemas
    • Accept both fields on /v1/chat/completions (apps/gateway/src/chat/schemas/completions.ts)
    • Accept both fields on /v1/responses (apps/gateway/src/responses/schemas.ts)
  • Types
    • Add PromptCacheRetention plus the two fields to:
      • OpenAIRequestBody
      • OpenAIResponsesRequestBody
      • (packages/models/src/types.ts)
    • Bring RequestBodyPreparer type alias in sync with the existing prepareRequestBody impl signature
  • Wiring
    • Thread the params through chat.tsresolve-provider-context.tsprepare-request-body.ts
    • Preserve them across Responses → Chat conversion in responses.ts
  • Forwarding
    • packages/actions/src/prepare-request-body.ts injects both fields into the upstream OpenAI request body when usedProvider === "openai"
    • Azure is intentionally excluded
  • Docs-validated retention enforcement
    • Add supportsOpenAIExtendedPromptCache(modelName) helper (packages/models/src/helpers.ts)
    • If prompt_cache_retention="24h" is sent for a model that doesn’t support extended retention, the gateway silently strips it (graceful degradation; request still succeeds with default in_memory)
    • Eligibility list comes directly from the OpenAI docs:
      • gpt-4.1
      • gpt-5, gpt-5-codex
      • gpt-5.1, gpt-5.1-codex, gpt-5.1-codex-mini, gpt-5.1-codex-max, gpt-5.1-chat-latest
      • gpt-5.2, gpt-5.4, gpt-5.5, gpt-5.5-pro
  • Cost test
    • Assert cacheWriteInputCost stays 0 for OpenAI even when cacheWriteTokens are present (no cache-write surcharge per OpenAI’s pricing model)

Tests

  • Unit
    • 3 new tests in packages/actions/src/prepare-request-body.spec.ts:
      • forwarding
      • Azure exclusion
      • stripping 24h on unsupported models
    • 1 new test in apps/gateway/src/lib/costs.spec.ts confirming no cache-write fee for OpenAI
  • E2E
    • New suite apps/gateway/src/native-openai-cache.e2e.ts (gated on LLM_OPENAI_API_KEY):
      • chat-completions cache hit
      • Responses API cache hit
      • 24h strip
      • 24h forward

Out of scope

The original “bill cache writes” issue is not addressed by this PR for OpenAI (see scope note). OpenAI’s read-side cache billing already works correctly in HEAD via the existing cachedInputPrice field on each model and the cached_tokens extraction in extract-token-usage.ts.

Summary by CodeRabbit

  • New Features

    • Added OpenAI prompt caching support to chat completions and Responses API with configurable prompt_cache_key and prompt_cache_retention parameters
    • Implemented model-aware cache retention handling for extended caching capabilities
  • Tests

    • Added end-to-end test coverage validating cache key acceptance, cached token detection, and model-specific retention behavior

Accept `prompt_cache_key` and `prompt_cache_retention` on `/v1/chat/completions` and `/v1/responses`, and forward them to OpenAI per the prompt caching docs.

Adds a docs-derived eligibility list for `prompt_cache_retention="24h"` (gpt-4.1, gpt-5.x family). When the param is sent on a model that only supports `in_memory` retention, it is silently stripped so the request still succeeds with the default cache lifetime.

Not a billing change: OpenAI does not surcharge cache writes, and read-side billing via `cachedInputPrice` already works in `HEAD`. A regression test asserts `cacheWriteInputCost` stays at 0 for OpenAI.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

Walkthrough

This PR adds OpenAI prompt caching support to the gateway by threading two parameters—prompt_cache_key and prompt_cache_retention—through request schemas, body preparation logic, gateway routing, and retry handlers. It includes model-aware logic to strip unsupported extended retention values and comprehensive test coverage validating end-to-end behavior.

Changes

Prompt Caching Pipeline

Layer / File(s) Summary
Type Definitions and Model Support
packages/models/src/types.ts, packages/models/src/helpers.ts
Introduces PromptCacheRetention union type (in_memory | 24h), extends OpenAIRequestBody and OpenAIResponsesRequestBody interfaces with optional cache fields, updates RequestBodyPreparer signature, and exports supportsOpenAIExtendedPromptCache() helper with model allowlist.
Request Schema Validation
apps/gateway/src/chat/schemas/completions.ts, apps/gateway/src/responses/schemas.ts, apps/gateway/src/responses/tools/convert-chat-to-responses.ts, apps/gateway/src/chat/tools/resolve-provider-context.ts
Adds prompt_cache_key (nullable string) and prompt_cache_retention (nullable enum) to completions and responses schemas with null-to-undefined transforms; updates ProviderContextOptions type and ResponsesEchoRequest interface.
Request Body Preparation
packages/actions/src/prepare-request-body.ts
Extends function signature to accept prompt_cache_key and prompt_cache_retention parameters; applies model-aware logic for both OpenAI Responses API and chat completions branches: forwards both params when model supports extended cache, strips 24h retention for unsupported models while preserving in_memory.
Gateway Request Threading
apps/gateway/src/chat/chat.ts, apps/gateway/src/responses/responses.ts, apps/gateway/src/chat/tools/resolve-provider-context.ts
Extracts cache parameters from validated requests, includes in cachePayload for proper cache key segmentation, passes through resolveProviderContext for retry handling, and forwards through Responses API handler to upstream.
Testing and Validation
packages/actions/src/prepare-request-body.spec.ts, apps/gateway/src/lib/costs.spec.ts, apps/gateway/src/native-openai-cache.e2e.ts
Unit tests validate parameter forwarding with model-aware retention, verifies no extra cache write fee for OpenAI, and end-to-end tests confirm cache key acceptance, cached token detection across both chat and Responses endpoints, and unsupported model retention parameter handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • theopenco/llmgateway#2193: Both PRs extend prepareRequestBody to thread provider-specific caching metadata (OpenAI prompt cache vs Bedrock cache/TTL).
  • theopenco/llmgateway#2101: Both PRs modify the same prepare-request-body.ts function with provider-aware parameter handling.
  • theopenco/llmgateway#2059: Both PRs modify prepareRequestBody and resolveProviderContext for provider request customization.

Suggested reviewers

  • steebchen
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: forward openai prompt cache controls' accurately summarizes the main change: exposing and forwarding OpenAI prompt caching parameters through gateway endpoints.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/gateway/src/native-openai-cache.e2e.ts`:
- Around line 28-31: Replace the usages of `any` in the send parameter, return
type and `last` variable with a minimal typed response shape (e.g., declare an
interface like CacheResponse { status: number; json: Record<string, unknown> |
null } or a narrower shape if you know fields) and use that type for the `send:
() => Promise<CacheResponse>`, the function's Promise return `{ status: number;
json: CacheResponse['json']; attempts: number }`, and `let last: CacheResponse`;
update any downstream assertions to narrow `json` before accessing properties.
This targets the `send` parameter type, the function return type, and the `last`
variable in the native-openai-cache.e2e test.
- Line 51: The module-scoped constant hasOpenAIKey caches
process.env.LLM_OPENAI_API_KEY which violates the rule against caching env
reads; remove hasOpenAIKey and instead read process.env.LLM_OPENAI_API_KEY at
the point of use (or wrap it in a tiny function like getHasOpenAIKey() used only
where needed) so the env is read on demand rather than at module load; update
any references to hasOpenAIKey to use process.env.LLM_OPENAI_API_KEY (or the new
getter) and ensure behavior remains the same in the tests that relied on
hasOpenAIKey.
- Line 53: The e2e suite declaration describe("e2e openai prompt cache",
getConcurrentTestOptions(), () => { ... }) enables concurrent execution; remove
the concurrent options so the suite runs serially — locate the describe call for
"e2e openai prompt cache" and delete the getConcurrentTestOptions() argument (or
replace it with no options) so the test file is not run in parallel per
repository e2e guidelines.

In `@packages/actions/src/prepare-request-body.spec.ts`:
- Around line 280-337: Tests are weakening type-safety by casting the response
of prepareOpenAITextRequest to "any"; instead remove the "as any" casts and
assert the concrete shape using proper typing or runtime type guards (e.g.,
define/ reuse an interface for the expected OpenAI request body or use
ReturnType<typeof prepareOpenAITextRequest> and narrow with type guards) and
then check requestBody.prompt_cache_key and requestBody.prompt_cache_retention
directly; update the test files to import/declare the expected request body type
or add minimal type guards around requestBody so the assertions remain strongly
typed without using "as any".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5e1386e1-d11c-46cd-8eee-88df5a104da6

📥 Commits

Reviewing files that changed from the base of the PR and between 177c590 and f91d25a.

📒 Files selected for processing (12)
  • apps/gateway/src/chat/chat.ts
  • apps/gateway/src/chat/schemas/completions.ts
  • apps/gateway/src/chat/tools/resolve-provider-context.ts
  • apps/gateway/src/lib/costs.spec.ts
  • apps/gateway/src/native-openai-cache.e2e.ts
  • apps/gateway/src/responses/responses.ts
  • apps/gateway/src/responses/schemas.ts
  • apps/gateway/src/responses/tools/convert-chat-to-responses.ts
  • packages/actions/src/prepare-request-body.spec.ts
  • packages/actions/src/prepare-request-body.ts
  • packages/models/src/helpers.ts
  • packages/models/src/types.ts

Comment thread apps/gateway/src/native-openai-cache.e2e.ts Outdated
Comment thread apps/gateway/src/native-openai-cache.e2e.ts Outdated

const hasOpenAIKey = !!process.env.LLM_OPENAI_API_KEY;

describe("e2e openai prompt cache", getConcurrentTestOptions(), () => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disable parallel execution for this e2e suite.

Line 53 uses getConcurrentTestOptions() in an .e2e.ts suite, which can introduce flaky cache/rate-limit behavior and conflicts with repo policy for e2e execution.

Suggested change
-describe("e2e openai prompt cache", getConcurrentTestOptions(), () => {
+describe("e2e openai prompt cache", getTestOptions(), () => {

As per coding guidelines, "**/*.{spec.ts,e2e.ts}: Do not run test files or suites in parallel unless the repository instructions for that exact suite explicitly require it".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
describe("e2e openai prompt cache", getConcurrentTestOptions(), () => {
describe("e2e openai prompt cache", getTestOptions(), () => {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/gateway/src/native-openai-cache.e2e.ts` at line 53, The e2e suite
declaration describe("e2e openai prompt cache", getConcurrentTestOptions(), ()
=> { ... }) enables concurrent execution; remove the concurrent options so the
suite runs serially — locate the describe call for "e2e openai prompt cache" and
delete the getConcurrentTestOptions() argument (or replace it with no options)
so the test file is not run in parallel per repository e2e guidelines.

Comment on lines +280 to +337
const requestBody = (await prepareOpenAITextRequest({
promptCacheKey: "tenant-a",
promptCacheRetention: "24h",
})) as any;

expect(requestBody.prompt_cache_key).toBe("tenant-a");
expect(requestBody.prompt_cache_retention).toBe("24h");
});

test("should forward prompt cache controls to OpenAI Responses API", async () => {
const requestBody = (await prepareOpenAITextRequest({
useResponsesApi: true,
promptCacheKey: "tenant-a",
promptCacheRetention: "in_memory",
})) as any;

expect(requestBody.prompt_cache_key).toBe("tenant-a");
expect(requestBody.prompt_cache_retention).toBe("in_memory");
});

test("should not forward OpenAI prompt cache controls to Azure", async () => {
const requestBody = (await prepareOpenAITextRequest({
provider: "azure",
promptCacheKey: "tenant-a",
promptCacheRetention: "24h",
})) as any;

expect(requestBody.prompt_cache_key).toBeUndefined();
expect(requestBody.prompt_cache_retention).toBeUndefined();
});

test("should strip prompt_cache_retention=24h on models that don't support extended retention", async () => {
const requestBody = (await prepareOpenAITextRequest({
model: "gpt-4o",
promptCacheKey: "tenant-a",
promptCacheRetention: "24h",
})) as any;

expect(requestBody.prompt_cache_key).toBe("tenant-a");
expect(requestBody.prompt_cache_retention).toBeUndefined();
});

test("should still forward prompt_cache_retention=in_memory on models without 24h support", async () => {
const requestBody = (await prepareOpenAITextRequest({
model: "gpt-4o",
promptCacheRetention: "in_memory",
})) as any;

expect(requestBody.prompt_cache_retention).toBe("in_memory");
});

test("should forward prompt_cache_retention=24h on models that do support extended retention", async () => {
const requestBody = (await prepareOpenAITextRequest({
model: "gpt-4.1",
promptCacheRetention: "24h",
})) as any;

expect(requestBody.prompt_cache_retention).toBe("24h");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Avoid as any in the new prompt-caching tests.

The new assertions use as any for fields that can be checked with proper unions/type guards, which weakens test type-safety unnecessarily.

As per coding guidelines, **/*.{ts,tsx}: "Never use any type or force cast with as any unless absolutely necessary."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/actions/src/prepare-request-body.spec.ts` around lines 280 - 337,
Tests are weakening type-safety by casting the response of
prepareOpenAITextRequest to "any"; instead remove the "as any" casts and assert
the concrete shape using proper typing or runtime type guards (e.g., define/
reuse an interface for the expected OpenAI request body or use ReturnType<typeof
prepareOpenAITextRequest> and narrow with type guards) and then check
requestBody.prompt_cache_key and requestBody.prompt_cache_retention directly;
update the test files to import/declare the expected request body type or add
minimal type guards around requestBody so the assertions remain strongly typed
without using "as any".

OpenAI cache hits and DB billing are already covered by `chat-prompt-caching.e2e.ts` (it iterates over every model with `cachedInputPrice`). The `24h` strip/forward gateway logic is covered deterministically by unit tests in `prepare-request-body.spec.ts`, so the dedicated e2e file added no unique coverage.
@steebchen steebchen enabled auto-merge May 7, 2026 18:07
@steebchen steebchen added this pull request to the merge queue May 7, 2026
Merged via the queue into theopenco:main with commit 0952689 May 7, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants