Skip to content

fix(responses): translate content parts in multi-step executor input#1161

Merged
JoshC8C7 merged 2 commits into
mainfrom
josh/responses-content-part-translation
Jun 22, 2026
Merged

fix(responses): translate content parts in multi-step executor input#1161
JoshC8C7 merged 2 commits into
mainfrom
josh/responses-content-part-translation

Conversation

@JoshC8C7

Copy link
Copy Markdown
Contributor

Summary

Multi-step /ai/v1/responses requests fail with an opaque executor_error
(stored as a 500) whenever a message carries array-form content parts, for
example {"type":"input_text","text":"..."}, which is exactly what the
OpenAI Responses SDKs emit. This fixes the executor's input translator to
rewrite content parts into chat-completions shape.

Root cause

When a /responses request includes tools, the multi-step executor runs.
Each model-call step is fired as a loopback HTTP request at the gateway's own
/ai/v1/chat/completions endpoint, which re-enters onwards' strict handler.

responses/transition.rs::translate_input_items built that body but, for a
message item, copied every field except the top-level type discriminator
and left the nested content array untouched. So a Responses message:

{ "role": "user", "content": [ { "type": "input_text", "text": "hi" } ] }

was forwarded with input_text intact. The loopback handler deserializes
with Json<ChatCompletionRequest>, whose ContentPart enum only accepts
text and image_url, so the typed extractor rejected the request with a
422 (empty body) before any provider call.

Two things made this specific:

  1. String-form content ("content": "hi") is valid in both schemas and
    round-tripped fine. Only array-form typed parts broke.
  2. The first-turn direct passthrough to the provider succeeds; only the
    decomposed loopback /chat/completions call, taken once tools force the
    executor path, rejects it.

onwards already has the correct mapping in strict/adapter.rs
(convert_message_content); the executor reinvented a partial translator
that skipped it. This fix mirrors that mapping at the JSON layer the executor
operates on.

Change

dwctl/src/responses/transition.rs:

  • The message branch now routes content through a new
    translate_message_content helper instead of copying it verbatim.
  • translate_message_content: string content passes through; an array is
    rewritten part-by-part; a message whose parts all drop collapses to
    empty-string content.
  • translate_content_part: maps input_text/output_text to text,
    input_image to image_url (with optional detail), refusal to text.
    Already chat-shaped text/image_url pass through. input_file and
    unknown types are dropped with a trace.

Tests

Added unit tests in transition.rs (all via parse_parent_request):

  • translates_input_text_content_parts_to_chat_text (production repro)
  • translates_output_text_content_parts_to_chat_text
  • passes_string_content_through_unchanged
  • translates_input_image_content_part_to_image_url
  • mixed_content_parts_keep_representable_drop_rest
  • all_unrepresentable_content_parts_collapse_to_empty_string

Run with just test rust.

Scope

Behavior change is limited to array-form message content on the executor
path. String content and the direct passthrough path are untouched.

🤖 Generated with Claude Code

Multi-step /v1/responses requests fail with an opaque executor_error
(stored as 500) whenever a message carries array-form content parts such as
{"type":"input_text","text":"..."}, which is what the OpenAI Responses
SDKs emit.

When tools are present, the executor decomposes the request into loopback
/v1/chat/completions calls. translate_input_items copied each message's
content verbatim, leaving Responses-only part types (input_text, ...) in
place. The loopback handler deserializes into ChatCompletionRequest, whose
ContentPart enum only accepts text and image_url, so the typed Json extractor
rejected the body with a 422 (empty body) before reaching the provider.

Rewrite content parts into chat-completions shape (input_text/output_text to
text, input_image to image_url, refusal to text), mirroring onwards'
convert_message_content. String content and already-chat-shaped parts pass
through; input_file and unknown parts are dropped. Adds unit tests including
the production repro.
Copilot AI review requested due to automatic review settings June 18, 2026 18:17
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 18, 2026

Copy link
Copy Markdown

Deploying control-layer with  Cloudflare Pages  Cloudflare Pages

Latest commit: 682f01d
Status: ✅  Deploy successful!
Preview URL: https://da4a54eb.control-layer.pages.dev
Branch Preview URL: https://josh-responses-content-part.control-layer.pages.dev

View logs

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes multi-step /ai/v1/responses execution failures when message content is provided as an array of typed Responses content parts (e.g. {"type":"input_text","text":"..."}), by translating those parts into chat-completions-compatible ContentPart shapes before issuing the loopback /ai/v1/chat/completions requests.

Changes:

  • Update the Responses input → chat-completions messages translator to rewrite message.content arrays via a new translate_message_content helper.
  • Add translate_content_part mapping for common Responses part types (input_text, output_text, input_image, refusal) and drop unsupported/unknown parts with logging.
  • Add unit tests covering array-form content translation, passthrough behavior for string content, and drop/collapse behavior for unsupported parts.

Comment on lines +321 to +325
/// Returns `None` for parts with no chat-completions representation
/// (`input_file`, unknown types); these are dropped with a trace rather than
/// forwarded, since the upstream schema would reject them. Already
/// chat-shaped parts (`text`, `image_url`) pass through unchanged so a client
/// that sent chat-completions content directly still works.

@doubleword-code doubleword-code Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR fixes a bug where multi-step /v1/responses requests with array-form content parts (e.g., {"type":"input_text","text":"..."}) would fail with a 422 error when tools were present. The root cause was that translate_input_items was copying content verbatim, leaving Responses-only part types (input_text, input_image, etc.) in place, which the loopback /v1/chat/completions handler's typed deserialization rejected.

The fix adds translate_message_content and translate_content_part functions to rewrite Responses content parts into Chat Completions shape, mirroring onwards' convert_message_content. The implementation is well-tested with 6 new unit tests covering the main scenarios.

Verdict: Approve with one non-blocking finding regarding input_image with file_id.

Research notes

  1. OpenAI Images and Vision documentation (https://platform.openai.com/docs/guides/images-vision): Confirmed the content part schemas:

    • Chat Completions: {"type": "image_url", "image_url": {"url": "..."}}
    • Responses API: {"type": "input_image", "image_url": "..."} (bare string, not nested)

    The docs also show that input_image can alternatively use file_id instead of image_url, which is a gap in the current implementation.

  2. walker.rs (dwctl/src/image_normalizer/walker.rs): Confirmed the two different shapes are already known in the codebase:

    • Lines 6-7: chat-completions shape with nested image_url.url
    • Lines 8-10: responses shape with bare image_url string

Suggested next steps

  1. Address Non-blocking finding about input_image with file_id - consider supporting this alternative format or documenting the limitation.
  2. Consider adding a test case for input_image with file_id to explicitly document the current behavior (dropping the part).

General findings

None - all other aspects of the implementation are correct and well-tested.

Some("input_image") => {
// Responses carries the image as a bare `image_url` string;
// chat-completions wraps it in an object with optional detail.
let url = part.get("image_url").and_then(|u| u.as_str())?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: The input_image translation only handles the image_url field, but according to OpenAI's Responses API documentation, input_image can also be expressed with file_id instead:

{"type": "input_image", "file_id": "file-abc123"}

When image_url is missing (e.g., when file_id is used), the ? operator returns None, silently dropping the content part. This may be acceptable for now if file_id-based images are rare in your use cases, but it's worth noting the gap.

Why it matters: Users sending input_image with file_id (uploaded via the Files API for vision) will have those image parts silently dropped, potentially breaking image-based workflows that rely on file uploads rather than URLs.

Suggested fix: Either:

  1. Add support for file_id by translating it to the chat-completions equivalent ({"type": "image_url", "image_url": {"file_id": "..."}}), or
  2. Add explicit logging when dropping input_image with file_id to make the limitation observable, or
  3. Document this limitation in the module-level docs

For reference, the OpenAI docs show both forms are valid:

// URL form
{"type": "input_image", "image_url": "https://..."}

// File ID form  
{"type": "input_image", "file_id": "file-abc123"}

@doubleword-code doubleword-code Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR fixes a bug where multi-step /v1/responses requests fail with a 500 error when messages contain array-form content parts (e.g., {"type":"input_text","text":"..."}). The fix translates Open Responses content part types (input_text, output_text, input_image, refusal) into their chat-completions equivalents (text, image_url) before the request reaches the loopback /v1/chat/completions handler.

Verdict: This is a well-implemented fix that addresses a real production issue. The code is thorough, well-tested, and follows existing patterns in the codebase. However, there is one potential edge case around annotation preservation that should be considered.

Research notes

I reviewed:

  • The diff showing changes to dwctl/src/responses/transition.rs
  • Related code in dwctl/src/image_normalizer/walker.rs which also handles Responses API content parts
  • The middleware and processor flow that uses this translation
  • Test coverage for the new functionality

The fix correctly mirrors the pattern used elsewhere in the codebase (e.g., walker.rs line 134) for detecting Responses API content part types. The comment at line 298 references onwards' typed convert_message_content (onwards/src/strict/adapter.rs) but since onwards is an external crate (version 0.33.1), I couldn't verify the exact implementation - however, the JSON-level approach taken here is appropriate for the translator layer.

Suggested next steps

  1. Non-blocking: Consider whether annotations on output_text parts should be preserved or dropped (see inline comment)
  2. Approve once any feedback is addressed

General findings

The implementation is solid:

  • String content passes through unchanged (correct - both schemas accept this)
  • Array content is translated part-by-part
  • Unrepresentable parts (input_file, unknown types) are dropped with appropriate logging
  • All-unrepresentable parts collapse to empty string rather than invalid empty array
  • Tests cover the production repro case and edge cases

One observation: the refusal part type conversion may need verification against actual upstream behavior - mapping refusal text to regular text content loses the semantic distinction, but this appears intentional given the comment about dropping unrepresentable parts.

General findings (auto-demoted from inline due to pre-validation)

  • Nit dwctl/src/responses/transition.rs:332 — Consistent with walker.rs line 138 which also accesses image_url as a bare string for Responses API shape. Good consistency across the codebase.
    • (demoted: code self-check failed at dwctl/src/responses/transition.rs:332: diff has Some("input_image") => {, model claimed let url = part.get("image_url").and_then(|u| u.as_str())?;)
  • Nit dwctl/src/responses/transition.rs:310 — Good edge case handling. Collapsing to empty string rather than empty array prevents a fresh validation error upstream. This matches the adapter behavior mentioned in the comment.
    • (demoted: code self-check failed at dwctl/src/responses/transition.rs:310: diff has // rather than an empty array, matching the adapter's behavior and keeping, model claimed // A message whose parts all dropped collapses to empty-string content)

/// that sent chat-completions content directly still works.
fn translate_content_part(part: &Value) -> Option<Value> {
match part.get("type").and_then(|t| t.as_str()) {
Some("input_text") | Some("output_text") => {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Consider annotation preservation for output_text parts.

Why it matters: The test at line 840 shows an output_text part with "annotations":[] being passed. The current implementation extracts only the text field and drops any annotations array. While most upstreams may not use annotations, the OpenAI Responses API spec includes annotations for things like citations, and dropping them could lose metadata that clients expect to be preserved round-trip.

Suggested fix: If annotation preservation is desired, you could add the annotations field to the translated part:

Some("input_text") | Some("output_text") => {
    let text = part.get("text").and_then(|t| t.as_str()).unwrap_or_default();
    let mut result = serde_json::Map::new();
    result.insert("type".to_string(), json!("text"));
    result.insert("text".to_string(), json!(text));
    // Optionally preserve annotations if present
    if let Some(annotations) = part.get("annotations") {
        result.insert("annotations".to_string(), annotations.clone());
    }
    Some(json!({"type": "text", "text": text})) // or Value::Object(result)
}

However, note that chat-completions ContentPart may not accept annotations either - verify what the upstream schema expects before adding this.

}
Some(json!({"type": "image_url", "image_url": Value::Object(image_url)}))
}
Some("refusal") => {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: Verify refusaltext mapping is acceptable.

Why it matters: The refusal part type carries model refusal messages (typically when the model declines to answer due to safety policies). Mapping it to plain text loses the semantic distinction between normal response content and a refusal. Clients that inspect the response structure might expect to handle refusals differently.

That said, this appears intentional - the comment says parts with "no chat-completions representation" are dropped or mapped. Since chat-completions doesn't have a refusal part type, mapping to text is pragmatic. The alternative would be to drop it entirely (like input_file), but that would lose the refusal message content entirely, which is worse.

Suggested fix: No change needed if this behavior is intentional. Consider adding a comment noting that refusal semantics are lost in the translation if this is a known limitation.

let Value::Array(parts) = content else {
return content.clone();
};
let translated: Vec<Value> = parts.iter().filter_map(translate_content_part).collect();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Clean use of filter_map to both translate and drop unrepresentable parts in one pass. This is idiomatic Rust and matches the pattern described in the doc comment.

// part types in place, which the loopback
// /v1/chat/completions handler's typed deserialization
// rejected with a 422.
"content" => {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Correct placement of the content translation hook. By handling this in the key-match within the message object iteration, you ensure all messages get their content translated regardless of role.

@JoshC8C7 JoshC8C7 merged commit ce50bd7 into main Jun 22, 2026
7 checks passed
@JoshC8C7 JoshC8C7 deleted the josh/responses-content-part-translation branch June 22, 2026 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants