fix: prevent compounding retries between SDK and framework retry layers#5335
Open
Shubhrakanti wants to merge 13 commits intomainfrom
Open
fix: prevent compounding retries between SDK and framework retry layers#5335Shubhrakanti wants to merge 13 commits intomainfrom
Shubhrakanti wants to merge 13 commits intomainfrom
Conversation
The framework's `_main_task()` in `llm.py` retries up to `max_retry` times (default 3) via `APIConnectOptions`. When the underlying OpenAI SDK client also has its own retry logic enabled, each framework-level retry triggers multiple silent SDK-level retries, causing a multiplicative blowup (e.g. 3 × 3 = 9 actual HTTP requests instead of the expected 3) with only the framework retries visible in logs. - `inference/llm.py`: Add `max_retries=0` to the OpenAI client constructor. This was the only client in the codebase missing this setting, causing it to use the SDK default of 2 silent retries. - `openai/llm.py`: Always set `max_retries=0` on the SDK client regardless of user input, and emit a DeprecationWarning when a non-zero value is passed. Users should configure retries exclusively through `APIConnectOptions.max_retry`. Made-with: Cursor
theomonnom
reviewed
Apr 3, 2026
| metadata: NotGivenOr[dict[str, str]] = NOT_GIVEN, | ||
| max_completion_tokens: NotGivenOr[int] = NOT_GIVEN, | ||
| timeout: httpx.Timeout | None = None, | ||
| max_retries: NotGivenOr[int] = NOT_GIVEN, |
Member
There was a problem hiding this comment.
by default it was 0? are you sure this is the issue
theomonnom
reviewed
Apr 3, 2026
| ) | ||
| from .utils import AsyncAzureADTokenProvider | ||
|
|
||
| logger = logging.getLogger(__name__) |
Member
There was a problem hiding this comment.
Suggested change
| logger = logging.getLogger(__name__) |
theomonnom
reviewed
Apr 3, 2026
| ) | ||
|
|
||
| if is_given(max_retries) and max_retries > 0: | ||
| warnings.warn( |
Member
There was a problem hiding this comment.
We deprecate using the logger in other parts of the code:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a bug where two independent retry layers compound multiplicatively, causing far more HTTP requests than expected with most retries invisible in logs.
The problem
The framework's
LLMStream._main_task()inllm.pyimplements retry logic controlled byAPIConnectOptions(default:max_retry=3,timeout=10s). Separately, the OpenAI SDK client has its own built-in retry mechanism (max_retries, SDK default: 2).When both layers are active, each framework-level retry triggers the
_run()method, which makes an API call through the OpenAI SDK. If that call times out, the SDK silently retries internally before surfacing the error back to the framework, which then retries again. This produces multiplicative behavior:_main_task)A user expecting 3 retries over ~30s instead sees 3 logged retries, each taking ~30s (3 × 10s timeout silently inside the SDK), for a total wait of ~120s with 12 actual HTTP requests.
The fix
inference/llm.py: Addmax_retries=0to theopenai.AsyncClientconstructor. This was the only client in the codebase that did not set this, causing it to silently use the OpenAI SDK's default of 2 retries. Every other client (openai/llm.py,openai/stt.py,openai/tts.py,openai/responses/llm.py) already hadmax_retries=0.openai/llm.py: Always setmax_retries=0on the SDK client regardless of user input. Previously, passingmax_retries=Nwould enable SDK-level retries that compound with the framework's retries. Now emits aDeprecationWarningwhen a non-zero value is passed, guiding users to configure retries exclusively throughAPIConnectOptions.max_retry.How to configure retries (before and after this fix)
Test plan
inference.LLMno longer triggers SDK-level retries (only framework retries visible in logs)openai.LLM(max_retries=3)emits aDeprecationWarningand does not pass the value to the SDK clientmax_retriesarg) continues to work withmax_retries=0on the SDK clientmake lint✅)Made with Cursor