Deployment Architecture

Overview

This document describes the build and deploy pipeline for Flash applications. It covers what happens when you run flash build and flash deploy, how endpoints are provisioned, and how the manifest ties everything together.

Build Pipeline

flash build

flash build packages your application into a deployable archive:

flash build
    │
    ├── 1. Discovery
    │   ├── Scan .py files for @Endpoint(...) decorators (QB)
    │   ├── Scan for Endpoint(...) variable assignments with routes (LB)
    │   └── Scan for @Endpoint(...) on classes (class-based QB)
    │
    ├── 2. Manifest Generation
    │   ├── Map functions/classes to resource names
    │   ├── Record LB routes (method + path)
    │   ├── Detect cross-endpoint calls (makes_remote_calls flag)
    │   └── Write flash_manifest.json
    │
    ├── 3. Handler Generation
    │   ├── QB functions: generate deployed handler (JSON in/out)
    │   ├── QB classes: generate class handler (singleton instance, method dispatch)
    │   └── LB endpoints: no handler needed (FastAPI server generated by runtime)
    │
    ├── 4. Dependency Installation
    │   ├── Install Python packages for linux/x86_64
    │   ├── Target Python 3.12 for wheel ABI selection
    │   └── Binary wheels only (no compilation)
    │
    └── 5. Packaging
        └── Create .flash/artifact.tar.gz

Discovery

The scanner (cli/commands/build_utils/scanner.py) uses AST analysis to find:

@Endpoint(...) on functions: Queue-based endpoints. One function per endpoint.
@Endpoint(...) on classes: Class-based queue-based endpoints. The class is instantiated once per worker (singleton), and methods are dispatched per request.
Endpoint(...) variable assignments: Load-balanced endpoints. Routes are registered via @ep.post("/path"), @ep.get("/path"), etc.

Manifest Structure

{
    "functions": [
        {
            "name": "process",
            "module_path": "gpu_worker",
            "resource_name": "gpu-worker",
            "is_load_balanced": false,
            "is_class": false,
            "dependencies": ["torch"],
            "makes_remote_calls": false
        },
        {
            "name": "MyModel",
            "module_path": "model_worker",
            "resource_name": "model-worker",
            "is_load_balanced": false,
            "is_class": true,
            "class_methods": ["predict", "embed"],
            "dependencies": ["torch", "transformers"]
        },
        {
            "name": "api",
            "module_path": "api_server",
            "resource_name": "api-server",
            "is_load_balanced": true,
            "is_class": false,
            "routes": [
                {"method": "POST", "path": "/predict", "handler": "predict"},
                {"method": "GET", "path": "/health", "handler": "health"}
            ]
        }
    ],
    "resources": [
        {"name": "gpu-worker", "is_load_balanced": false},
        {"name": "model-worker", "is_load_balanced": false},
        {"name": "api-server", "is_load_balanced": true}
    ]
}

Handler Generation

The handler generator (cli/commands/build_utils/handler_generator.py) produces different handlers based on endpoint type:

Function QB handler -- wraps a function for Runpod's serverless protocol:

# generated handler for QB function endpoints
from module_path import function_name

def handler(job):
    job_input = job["input"]
    result = function_name(job_input)
    return result

Class QB handler -- instantiates class once, dispatches to methods:

# generated handler for class-based QB endpoints
from module_path import ClassName

_instance = ClassName()
_METHODS = {"predict": _instance.predict, "embed": _instance.embed}

def handler(job):
    job_input = job["input"]
    # single-method classes auto-dispatch
    # multi-method classes require "method" key in input
    method_name = job_input.pop("method", None)
    method = _METHODS[method_name]
    return method(**job_input)

LB endpoints do not need generated handlers. The LB runtime image starts a FastAPI server that loads routes from the manifest.

Deploy Pipeline

flash deploy

flash deploy runs the build pipeline, then provisions endpoints:

flash deploy --env production
    │
    ├── 1. Build (same as flash build)
    │
    ├── 2. Upload
    │   └── Upload artifact.tar.gz to Runpod storage (R2)
    │
    ├── 3. Provision Endpoints
    │   ├── For each resource in manifest:
    │   │   ├── Check if endpoint exists (by name in environment)
    │   │   ├── If new: create endpoint via GraphQL API
    │   │   ├── If exists + config drift: update endpoint
    │   │   └── If exists + no drift: skip
    │   └── Set env vars on each endpoint (explicit env={} + system vars like RUNPOD_API_KEY)
    │
    ├── 4. Register with State Manager
    │   └── Store endpoint IDs for cross-endpoint routing
    │
    └── 5. Post-Deploy
        ├── Display endpoint URLs
        └── Show available routes

Resource Class Selection

Endpoint._build_resource_config() selects the appropriate internal resource class for provisioning:

Usage Pattern	GPU	CPU	Internal Class
`@Endpoint(...)` on function	yes	--	`LiveServerless`
`@Endpoint(...)` on function	--	yes	`CpuLiveServerless`
`@Endpoint(...)` on class	yes	--	`LiveServerless`
`ep = Endpoint(...)` + routes	yes	--	`LiveLoadBalancer`
`ep = Endpoint(...)` + routes	--	yes	`CpuLiveLoadBalancer`
`Endpoint(image=...)`	yes	--	`ServerlessEndpoint`
`Endpoint(image=...)`	--	yes	`CpuServerlessEndpoint`

Docker Images

Each resource class maps to a specific Docker image:

Internal Class	Image	Base
`LiveServerless`	`runpod/worker-flash:latest`	PyTorch + CUDA
`CpuLiveServerless`	`runpod/worker-flash-cpu:latest`	Python slim
`LiveLoadBalancer`	`runpod/worker-flash-lb:latest`	PyTorch + FastAPI
`CpuLiveLoadBalancer`	`runpod/worker-flash-lb-cpu:latest`	Python slim + FastAPI

Config Drift Detection

When deploying to an environment that already has endpoints, Flash compares the current configuration hash against the stored hash. If they differ, the endpoint is updated. See Resource Config Drift Detection for details.

Cross-Endpoint Routing at Deploy Time

When flash deploy provisions endpoints:

Endpoints with makes_remote_calls=True get RUNPOD_API_KEY injected automatically
Each endpoint gets the flash_manifest.json included in its artifact
The State Manager stores {environment_id, resource_name} -> endpoint_id
At runtime, the ServiceRegistry uses the manifest + State Manager to route calls

Manifest credential handling

Runtime endpoint metadata (including API-returned aiKey) may be stored in the State Manager manifest for deployment reconciliation.
Local .flash/flash_manifest.json is sanitized before it is written to disk and does not include aiKey.
RUNPOD_API_KEY is sourced from environment/credential storage and injected into endpoint env when needed; it is not persisted in the local manifest.

See Cross-Endpoint Routing for the full runtime flow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment Architecture

Overview

Build Pipeline

flash build

Discovery

Manifest Structure

Handler Generation

Deploy Pipeline

flash deploy

Resource Class Selection

Docker Images

Config Drift Detection

Cross-Endpoint Routing at Deploy Time

Manifest credential handling

Related Documentation

FilesExpand file tree

Deployment_Architecture.md

Latest commit

History

Deployment_Architecture.md

File metadata and controls

Deployment Architecture

Overview

Build Pipeline

flash build

Discovery

Manifest Structure

Handler Generation

Deploy Pipeline

flash deploy

Resource Class Selection

Docker Images

Config Drift Detection

Cross-Endpoint Routing at Deploy Time

Manifest credential handling

Related Documentation