fix(responses): translate content parts in multi-step executor input#1161
Conversation
Multi-step /v1/responses requests fail with an opaque executor_error
(stored as 500) whenever a message carries array-form content parts such as
{"type":"input_text","text":"..."}, which is what the OpenAI Responses
SDKs emit.
When tools are present, the executor decomposes the request into loopback
/v1/chat/completions calls. translate_input_items copied each message's
content verbatim, leaving Responses-only part types (input_text, ...) in
place. The loopback handler deserializes into ChatCompletionRequest, whose
ContentPart enum only accepts text and image_url, so the typed Json extractor
rejected the body with a 422 (empty body) before reaching the provider.
Rewrite content parts into chat-completions shape (input_text/output_text to
text, input_image to image_url, refusal to text), mirroring onwards'
convert_message_content. String content and already-chat-shaped parts pass
through; input_file and unknown parts are dropped. Adds unit tests including
the production repro.
Deploying control-layer with
|
| Latest commit: |
682f01d
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://da4a54eb.control-layer.pages.dev |
| Branch Preview URL: | https://josh-responses-content-part.control-layer.pages.dev |
There was a problem hiding this comment.
Pull request overview
Fixes multi-step /ai/v1/responses execution failures when message content is provided as an array of typed Responses content parts (e.g. {"type":"input_text","text":"..."}), by translating those parts into chat-completions-compatible ContentPart shapes before issuing the loopback /ai/v1/chat/completions requests.
Changes:
- Update the Responses
input→ chat-completionsmessagestranslator to rewritemessage.contentarrays via a newtranslate_message_contenthelper. - Add
translate_content_partmapping for common Responses part types (input_text,output_text,input_image,refusal) and drop unsupported/unknown parts with logging. - Add unit tests covering array-form content translation, passthrough behavior for string content, and drop/collapse behavior for unsupported parts.
| /// Returns `None` for parts with no chat-completions representation | ||
| /// (`input_file`, unknown types); these are dropped with a trace rather than | ||
| /// forwarded, since the upstream schema would reject them. Already | ||
| /// chat-shaped parts (`text`, `image_url`) pass through unchanged so a client | ||
| /// that sent chat-completions content directly still works. |
There was a problem hiding this comment.
Summary
This PR fixes a bug where multi-step /v1/responses requests with array-form content parts (e.g., {"type":"input_text","text":"..."}) would fail with a 422 error when tools were present. The root cause was that translate_input_items was copying content verbatim, leaving Responses-only part types (input_text, input_image, etc.) in place, which the loopback /v1/chat/completions handler's typed deserialization rejected.
The fix adds translate_message_content and translate_content_part functions to rewrite Responses content parts into Chat Completions shape, mirroring onwards' convert_message_content. The implementation is well-tested with 6 new unit tests covering the main scenarios.
Verdict: Approve with one non-blocking finding regarding input_image with file_id.
Research notes
-
OpenAI Images and Vision documentation (https://platform.openai.com/docs/guides/images-vision): Confirmed the content part schemas:
- Chat Completions:
{"type": "image_url", "image_url": {"url": "..."}} - Responses API:
{"type": "input_image", "image_url": "..."}(bare string, not nested)
The docs also show that
input_imagecan alternatively usefile_idinstead ofimage_url, which is a gap in the current implementation. - Chat Completions:
-
walker.rs (
dwctl/src/image_normalizer/walker.rs): Confirmed the two different shapes are already known in the codebase:- Lines 6-7: chat-completions shape with nested
image_url.url - Lines 8-10: responses shape with bare
image_urlstring
- Lines 6-7: chat-completions shape with nested
Suggested next steps
- Address Non-blocking finding about
input_imagewithfile_id- consider supporting this alternative format or documenting the limitation. - Consider adding a test case for
input_imagewithfile_idto explicitly document the current behavior (dropping the part).
General findings
None - all other aspects of the implementation are correct and well-tested.
| Some("input_image") => { | ||
| // Responses carries the image as a bare `image_url` string; | ||
| // chat-completions wraps it in an object with optional detail. | ||
| let url = part.get("image_url").and_then(|u| u.as_str())?; |
There was a problem hiding this comment.
Non-blocking: The input_image translation only handles the image_url field, but according to OpenAI's Responses API documentation, input_image can also be expressed with file_id instead:
{"type": "input_image", "file_id": "file-abc123"}When image_url is missing (e.g., when file_id is used), the ? operator returns None, silently dropping the content part. This may be acceptable for now if file_id-based images are rare in your use cases, but it's worth noting the gap.
Why it matters: Users sending input_image with file_id (uploaded via the Files API for vision) will have those image parts silently dropped, potentially breaking image-based workflows that rely on file uploads rather than URLs.
Suggested fix: Either:
- Add support for
file_idby translating it to the chat-completions equivalent ({"type": "image_url", "image_url": {"file_id": "..."}}), or - Add explicit logging when dropping
input_imagewithfile_idto make the limitation observable, or - Document this limitation in the module-level docs
For reference, the OpenAI docs show both forms are valid:
// URL form
{"type": "input_image", "image_url": "https://..."}
// File ID form
{"type": "input_image", "file_id": "file-abc123"}There was a problem hiding this comment.
Summary
This PR fixes a bug where multi-step /v1/responses requests fail with a 500 error when messages contain array-form content parts (e.g., {"type":"input_text","text":"..."}). The fix translates Open Responses content part types (input_text, output_text, input_image, refusal) into their chat-completions equivalents (text, image_url) before the request reaches the loopback /v1/chat/completions handler.
Verdict: This is a well-implemented fix that addresses a real production issue. The code is thorough, well-tested, and follows existing patterns in the codebase. However, there is one potential edge case around annotation preservation that should be considered.
Research notes
I reviewed:
- The diff showing changes to
dwctl/src/responses/transition.rs - Related code in
dwctl/src/image_normalizer/walker.rswhich also handles Responses API content parts - The middleware and processor flow that uses this translation
- Test coverage for the new functionality
The fix correctly mirrors the pattern used elsewhere in the codebase (e.g., walker.rs line 134) for detecting Responses API content part types. The comment at line 298 references onwards' typed convert_message_content (onwards/src/strict/adapter.rs) but since onwards is an external crate (version 0.33.1), I couldn't verify the exact implementation - however, the JSON-level approach taken here is appropriate for the translator layer.
Suggested next steps
- Non-blocking: Consider whether
annotationsonoutput_textparts should be preserved or dropped (see inline comment) - Approve once any feedback is addressed
General findings
The implementation is solid:
- String content passes through unchanged (correct - both schemas accept this)
- Array content is translated part-by-part
- Unrepresentable parts (
input_file, unknown types) are dropped with appropriate logging - All-unrepresentable parts collapse to empty string rather than invalid empty array
- Tests cover the production repro case and edge cases
One observation: the refusal part type conversion may need verification against actual upstream behavior - mapping refusal text to regular text content loses the semantic distinction, but this appears intentional given the comment about dropping unrepresentable parts.
General findings (auto-demoted from inline due to pre-validation)
- Nit
dwctl/src/responses/transition.rs:332— Consistent withwalker.rsline 138 which also accessesimage_urlas a bare string for Responses API shape. Good consistency across the codebase.- (demoted: code self-check failed at dwctl/src/responses/transition.rs:332: diff has
Some("input_image") => {, model claimedlet url = part.get("image_url").and_then(|u| u.as_str())?;)
- (demoted: code self-check failed at dwctl/src/responses/transition.rs:332: diff has
- Nit
dwctl/src/responses/transition.rs:310— Good edge case handling. Collapsing to empty string rather than empty array prevents a fresh validation error upstream. This matches the adapter behavior mentioned in the comment.- (demoted: code self-check failed at dwctl/src/responses/transition.rs:310: diff has
// rather than an empty array, matching the adapter's behavior and keeping, model claimed// A message whose parts all dropped collapses to empty-string content)
- (demoted: code self-check failed at dwctl/src/responses/transition.rs:310: diff has
| /// that sent chat-completions content directly still works. | ||
| fn translate_content_part(part: &Value) -> Option<Value> { | ||
| match part.get("type").and_then(|t| t.as_str()) { | ||
| Some("input_text") | Some("output_text") => { |
There was a problem hiding this comment.
Non-blocking: Consider annotation preservation for output_text parts.
Why it matters: The test at line 840 shows an output_text part with "annotations":[] being passed. The current implementation extracts only the text field and drops any annotations array. While most upstreams may not use annotations, the OpenAI Responses API spec includes annotations for things like citations, and dropping them could lose metadata that clients expect to be preserved round-trip.
Suggested fix: If annotation preservation is desired, you could add the annotations field to the translated part:
Some("input_text") | Some("output_text") => {
let text = part.get("text").and_then(|t| t.as_str()).unwrap_or_default();
let mut result = serde_json::Map::new();
result.insert("type".to_string(), json!("text"));
result.insert("text".to_string(), json!(text));
// Optionally preserve annotations if present
if let Some(annotations) = part.get("annotations") {
result.insert("annotations".to_string(), annotations.clone());
}
Some(json!({"type": "text", "text": text})) // or Value::Object(result)
}However, note that chat-completions ContentPart may not accept annotations either - verify what the upstream schema expects before adding this.
| } | ||
| Some(json!({"type": "image_url", "image_url": Value::Object(image_url)})) | ||
| } | ||
| Some("refusal") => { |
There was a problem hiding this comment.
Non-blocking: Verify refusal → text mapping is acceptable.
Why it matters: The refusal part type carries model refusal messages (typically when the model declines to answer due to safety policies). Mapping it to plain text loses the semantic distinction between normal response content and a refusal. Clients that inspect the response structure might expect to handle refusals differently.
That said, this appears intentional - the comment says parts with "no chat-completions representation" are dropped or mapped. Since chat-completions doesn't have a refusal part type, mapping to text is pragmatic. The alternative would be to drop it entirely (like input_file), but that would lose the refusal message content entirely, which is worse.
Suggested fix: No change needed if this behavior is intentional. Consider adding a comment noting that refusal semantics are lost in the translation if this is a known limitation.
| let Value::Array(parts) = content else { | ||
| return content.clone(); | ||
| }; | ||
| let translated: Vec<Value> = parts.iter().filter_map(translate_content_part).collect(); |
There was a problem hiding this comment.
Nit: Clean use of filter_map to both translate and drop unrepresentable parts in one pass. This is idiomatic Rust and matches the pattern described in the doc comment.
| // part types in place, which the loopback | ||
| // /v1/chat/completions handler's typed deserialization | ||
| // rejected with a 422. | ||
| "content" => { |
There was a problem hiding this comment.
Nit: Correct placement of the content translation hook. By handling this in the key-match within the message object iteration, you ensure all messages get their content translated regardless of role.
Summary
Multi-step
/ai/v1/responsesrequests fail with an opaqueexecutor_error(stored as a 500) whenever a message carries array-form content parts, for
example
{"type":"input_text","text":"..."}, which is exactly what theOpenAI Responses SDKs emit. This fixes the executor's input translator to
rewrite content parts into chat-completions shape.
Root cause
When a
/responsesrequest includestools, the multi-step executor runs.Each model-call step is fired as a loopback HTTP request at the gateway's own
/ai/v1/chat/completionsendpoint, which re-enters onwards' strict handler.responses/transition.rs::translate_input_itemsbuilt that body but, for amessageitem, copied every field except the top-leveltypediscriminatorand left the nested
contentarray untouched. So a Responses message:{ "role": "user", "content": [ { "type": "input_text", "text": "hi" } ] }was forwarded with
input_textintact. The loopback handler deserializeswith
Json<ChatCompletionRequest>, whoseContentPartenum only acceptstextandimage_url, so the typed extractor rejected the request with a422 (empty body) before any provider call.
Two things made this specific:
"content": "hi") is valid in both schemas andround-tripped fine. Only array-form typed parts broke.
decomposed loopback
/chat/completionscall, taken oncetoolsforce theexecutor path, rejects it.
onwards already has the correct mapping in
strict/adapter.rs(
convert_message_content); the executor reinvented a partial translatorthat skipped it. This fix mirrors that mapping at the JSON layer the executor
operates on.
Change
dwctl/src/responses/transition.rs:messagebranch now routescontentthrough a newtranslate_message_contenthelper instead of copying it verbatim.translate_message_content: string content passes through; an array isrewritten part-by-part; a message whose parts all drop collapses to
empty-string content.
translate_content_part: mapsinput_text/output_texttotext,input_imagetoimage_url(with optionaldetail),refusaltotext.Already chat-shaped
text/image_urlpass through.input_fileandunknown types are dropped with a trace.
Tests
Added unit tests in
transition.rs(all viaparse_parent_request):translates_input_text_content_parts_to_chat_text(production repro)translates_output_text_content_parts_to_chat_textpasses_string_content_through_unchangedtranslates_input_image_content_part_to_image_urlmixed_content_parts_keep_representable_drop_restall_unrepresentable_content_parts_collapse_to_empty_stringRun with
just test rust.Scope
Behavior change is limited to array-form message content on the executor
path. String content and the direct passthrough path are untouched.
🤖 Generated with Claude Code