feat: AE-1225: poll request logs for qb workers#276
Conversation
runpod-Henrik
left a comment
There was a problem hiding this comment.
1. Bug: start_time default is evaluated at import time, not instantiation
# request_logs.py:35
def __init__(
self,
...
start_time: datetime = datetime.now(timezone.utc), # evaluated ONCE at module load
):datetime.now() in a default argument is computed when the module is first imported, not when QBRequestLogFetcher() is called. Every instance created without an explicit start_time — including the one created in run() — gets the same stale timestamp. On a long-running process that calls run() hours after startup, the first poll will request logs from the entire lifetime of the process.
Fix:
start_time: Optional[datetime] = None
# in body:
self.start_time = start_time if start_time is not None else datetime.now(timezone.utc)2. Issue: print() for log output bypasses the logging system
# serverless.py:240
print(f"worker log: {line}")Flash uses log = logging.getLogger(__name__) throughout. print() bypasses LOG_LEVEL filtering and SensitiveDataFilter. If a worker log line contains an API key or sensitive value, it will not be filtered.
Should be log.info("worker log: %s", line) (or rich.print if the intent is always-visible user output, but then it should be consistent with how the rest of the CLI surfaces output to users).
3. Issue: fetched_until stalls when logs have no timestamps
# request_logs.py:165
else:
# not all logs have a timestamp, assume we should refetch
self.fetched_until = self.start_timeWhen records have no dt field, fetched_until is set to start_time (unchanged). The next call to fetch_logs() then sets self.start_time = self.fetched_until = self.start_time — the window never advances. The deduplication seen set prevents duplicate output, but the API is called with the same time range on every poll for the entire duration of the job.
Should fall back to end_utc (the timestamp of the current fetch) rather than start_time.
4. Issue: _fetch_worker_id is dead code
# request_logs.py:67
async def _fetch_worker_id(self, endpoint_id, request_id, runpod_api_key):
...This method is never called anywhere in the PR. fetch_logs() doesn't pass a request_id, matched_by_request_id is always False, and worker_id is always None. If this is scaffolding for future work, a comment would help. If it's not needed, removing it keeps the surface area clean.
5. Question: stdout deduplication path has no test coverage
The tests for run() patch out _emit_endpoint_logs entirely with AsyncMock, so fetcher.seen is never populated. The stdout deduplication block:
# serverless.py:1157-1166
if raw in fetcher.seen:
continuehas no test. A test that lets _emit_endpoint_logs populate fetcher.seen and then verifies that matching stdout lines are stripped from the final output would close this gap.
6. Minor: _emit_endpoint_logs has inconsistent return
if not batch:
return False # explicit False
# other paths:
return # implicit NoneThe return value is unused by the caller, so no functional impact — but the mixed False/None is confusing. Either make it -> None throughout (remove the return False) or commit to returning a bool.
Verdict
The mutable default argument is a real bug that will cause missed logs on any process that imports flash before calling run(). The other three issues (print vs logger, fetched_until stall, dead method) are worth fixing before merge. Tests are solid for the happy path — the one gap is the stdout deduplication path.
|
I'm seeing double logs |
|
@KAJdev yeah, the double logs thing happens in main. At the end of the request flash by default replays stdout from the completed job request into your terminal. Currently the logging handler puts both formatted logs and raw log lines into stdout. For example {
"delayTime": 10342,
"executionTime": 5078,
"id": "3655fecf-ee84-45db-832d-6bde2b7031d9-u1",
"output": {
"instance_id": null,
"instance_info": null,
"json_result": null,
"result": "gAWVFQAAAAAAAAB9lIwGcmVzdWx0lIwFaG93ZHmUcy4=",
"stdout": "2026-03-19 13:55:24,435 | INFO | 3655fecf-ee84-45db-832d-6bde2b7031d9-u1 | jello: 0\n2026-03-19 13:55:25,435 | INFO | 3655fecf-ee84-45db-832d-6bde2b7031d9-u1 | jello: 1\n2026-03-19 13:55:26,436 | INFO | 3655fecf-ee84-45db-832d-6bde2b7031d9-u1 | jello: 2\n2026-03-19 13:55:27,436 | INFO | 3655fecf-ee84-45db-832d-6bde2b7031d9-u1 | jello: 3\n2026-03-19 13:55:28,436 | INFO | 3655fecf-ee84-45db-832d-6bde2b7031d9-u1 | jello: 4\n\njello: 0\njello: 1\njello: 2\njello: 3\njello: 4\n",
"success": true
},
"status": "COMPLETED",
"workerId": "k7zjko9p3g579t"
}I am gonna try and figure this out as a short term but also in your case you didn't actually get any polled log lines 🫠 the latency between when they're actually available as endpoint logs is pretty high |
|
yeah i even tried 60s requests and would only get about 1 log out of 30 at runtime. I think we probably should think of a different method since I'm not sure this is going to be that useful |
# Conflicts: # src/runpod_flash/core/resources/serverless.py
There was a problem hiding this comment.
Pull request overview
This PR adds real-time log/diagnostic visibility for queue-based (QB) serverless runs, plus improves deployment manifest reconciliation so runtime endpoint metadata (IDs/URLs/aiKey) is preserved more reliably across deploys.
Changes:
- Introduces QB pod log polling during async
ServerlessResource.run(...)(including phase/status messaging and stdout de-dupe against streamed logs). - Adds a
WorkerAvailabilityDiagnosticflow to improve “no workers available” diagnostics (GPU/CPU availability + throttling guidance). - Updates deployment reconciliation to carry forward endpoint metadata (endpoint_id/url normalization/aiKey) and sanitizes local manifest writes.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/runpod_flash/core/resources/serverless.py |
Polls QB request logs during async runs; emits phase/status messages; attempts stdout de-dupe vs streamed logs. |
src/runpod_flash/core/resources/request_logs.py |
Adds QBRequestLogFetcher + models to fetch status/metrics/pod logs and return phased batches. |
src/runpod_flash/core/resources/worker_availability_diagnostic.py |
Adds availability diagnostics via GraphQL (GPU/CPU + throttled handling). |
src/runpod_flash/core/api/runpod.py |
Adds GraphQL helpers to query GPU/CPU stock status. |
src/runpod_flash/cli/utils/deployment.py |
Persists endpoint_id/aiKey into state manifest; normalizes endpoint_url; sanitizes local manifest to avoid writing aiKey. |
tests/unit/resources/test_serverless.py |
Adds tests for QB log polling behaviors + stdout de-dupe + repeated diagnostics. |
tests/unit/resources/test_request_logs.py |
Adds tests for fetcher phases, priming/streaming, and auth fallback behavior. |
tests/unit/resources/test_worker_availability_diagnostic.py |
Adds tests for diagnostics messages/reasons (max=0, gpu/cpu availability, throttled, out-of-stock). |
tests/unit/cli/utils/test_deployment.py |
Adds tests for persisting endpoint_id/aiKey to state manifest and sanitizing local manifest disk writes. |
src/runpod_flash/cli/docs/flash-logging.md |
Documents QB request log polling during async run(...). |
src/runpod_flash/cli/docs/flash-deploy.md |
Documents manifest credential handling and local manifest sanitization. |
docs/Deployment_Architecture.md |
Documents how aiKey is handled between state manifest vs local manifest. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
some examples of behavior