feat(rerank): add configurable HTTP timeout for OpenAI-compatible client#2784
Merged
qin-ctx merged 2 commits intoJun 23, 2026
Merged
Conversation
OpenAIRerankClient hardcoded a 30s HTTP timeout, which is insufficient for local LLM servers (e.g. llama.cpp on ROCm) that incur model cold-start latency on the first request after inactivity, causing ReadTimeout errors. Add a `timeout` field to RerankConfig (default 30.0, backwards-compatible) and thread it through OpenAIRerankClient.__init__, from_config, and the requests.post call in rerank_batch. The timeout can now be set per-environment in ov.conf, e.g. "timeout": 120. Closes volcengine#2732
qin-ctx
approved these changes
Jun 23, 2026
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
OpenAIRerankClienthardcodes a 30-second HTTP timeout. When using local LLM servers (e.g. llama.cpp on ROCm) that require cold-start model loading, 30s is often insufficient, causingReadTimeouterrors on the first request after inactivity.Solution
Add a configurable
timeoutfield toRerankConfig(default30.0, fully backwards-compatible) and thread it throughOpenAIRerankClient.__init__,from_config, and therequests.postcall inrerank_batch. The timeout can now be set per-environment inov.conf:Subsequent requests still benefit from the already-warm model; only the cold-start first call needs the longer budget.
Why a config field (not a hardcoded higher value or retries)
Tests
Added
tests/unit/models/rerank/test_openai_rerank_timeout.py(7 tests): default timeout, custom timeout, config default,from_configthreading (custom + default), andrerank_batchpassing the configured/default timeout torequests.post. New tests plus the existing rerank suite pass (14 passed).Closes #2732