diff --git a/.cursor/rules/rp-styleguide.mdc b/.cursor/rules/rp-styleguide.mdc
index 60a049f0..8d9db909 100644
--- a/.cursor/rules/rp-styleguide.mdc
+++ b/.cursor/rules/rp-styleguide.mdc
@@ -5,7 +5,7 @@ alwaysApply: true
---
Always use sentence case for headings and titles.
-These are proper nouns: Runpod, Pods, Serverless, Hub, Instant Clusters, Secure Cloud, Community Cloud, Tetra.
+These are proper nouns: Runpod, Pods, Serverless, Hub, Instant Clusters, Secure Cloud, Community Cloud, Flash.
These are generic terms: endpoint, worker, cluster, template, handler, fine-tune, network volume.
Prefer using paragraphs to bullet points unless directly asked.
diff --git a/CLAUDE.md b/CLAUDE.md
index 3e7dae0c..0be9f6d3 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -98,7 +98,7 @@ Follow the Runpod style guide (`.cursor/rules/rp-styleguide.mdc`) and Google Dev
### Capitalization and Terminology
- **Always use sentence case** for headings and titles
-- **Proper nouns**: Runpod, Pods, Serverless, Hub, Instant Clusters, Secure Cloud, Community Cloud, Tetra
+- **Proper nouns**: Runpod, Pods, Serverless, Hub, Instant Clusters, Secure Cloud, Community Cloud, Flash
- **Generic terms** (lowercase): endpoint, worker, cluster, template, handler, fine-tune, network volume
### Writing Style
diff --git a/community-solutions/ohmyrunpod/overview.mdx b/community-solutions/ohmyrunpod/overview.mdx
index 84767879..354dda84 100644
--- a/community-solutions/ohmyrunpod/overview.mdx
+++ b/community-solutions/ohmyrunpod/overview.mdx
@@ -17,10 +17,10 @@ Check the repository for additional features, updates, and documentation: [githu
## Key features
-
+
Simplified file transfer between your local machine and Runpod instances using SFTP or Croc
-
+
Automatically configures SSH access with secure key generation and password management
diff --git a/community-solutions/overview.mdx b/community-solutions/overview.mdx
index 8a07d5e4..688daa02 100644
--- a/community-solutions/overview.mdx
+++ b/community-solutions/overview.mdx
@@ -17,7 +17,7 @@ Community tools and solutions are provided as-is and maintained by their creator
Explore these community-created tools that can enhance your Runpod workflow:
-
-
-
-
+
## 📄️ Overview
Learn how to build and deploy applications on the Runpod platform with this set of tutorials, covering tools, technologies, and deployment methods, including Containers, Docker, and Serverless implementation.
-
+
## 📄️ Intro to containers
Discover the world of containerization with Docker, a platform for isolated environments that package applications, frameworks, and libraries into self-contained containers for consistent and reliable deployment across diverse computing environments.
-
+
## 📄️ Dockerfile
Learn how to create a Dockerfile to customize a Docker image and use an entrypoint script to run a command when the container starts, making it a reusable and executable unit for deploying and sharing applications.
-
+
## 📄️ Persist data outside of containers
Learn how to persist data outside of containers by creating named volumes, mounting volumes to data directories, and accessing persisted data from multiple container runs and removals in Docker.
-
+
## 📄️ Docker commands
Runpod enables BYOC development with Docker, providing a reference sheet for commonly used Docker commands, including login, images, containers, Dockerfile, volumes, network, and execute.
diff --git a/docs.json b/docs.json
index 83121e08..e7ea1bf1 100644
--- a/docs.json
+++ b/docs.json
@@ -9,6 +9,7 @@
},
"background": {},
"styling": {
+ "css": "/style.css",
"codeblocks": {
"theme": {
"dark": "github-dark",
@@ -44,6 +45,55 @@
"get-started/mcp-servers"
]
},
+ {
+ "group": "Flash",
+ "pages": [
+ "flash/overview",
+ "flash/quickstart",
+ "flash/pricing",
+ "flash/create-endpoints",
+ "flash/custom-docker-images",
+ {
+ "group": "Configure resources",
+ "pages": [
+ "flash/configuration/storage",
+ "flash/configuration/gpu-types",
+ "flash/configuration/cpu-types",
+ "flash/configuration/parameters",
+ "flash/configuration/best-practices"
+ ]
+ },
+ "flash/execution-model",
+ "flash/troubleshooting",
+ {
+ "group": "Build apps",
+ "pages": [
+ "flash/apps/overview",
+ "flash/apps/build-app",
+ "flash/apps/initialize-project",
+ "flash/apps/customize-app",
+ "flash/apps/local-testing",
+ "flash/apps/apps-and-environments",
+ "flash/apps/deploy-apps",
+ "flash/apps/requests"
+ ]
+ },
+ {
+ "group": "CLI reference",
+ "pages": [
+ "flash/cli/overview",
+ "flash/cli/init",
+ "flash/cli/login",
+ "flash/cli/run",
+ "flash/cli/build",
+ "flash/cli/deploy",
+ "flash/cli/env",
+ "flash/cli/app",
+ "flash/cli/undeploy"
+ ]
+ }
+ ]
+ },
{
"group": "Serverless",
"pages": [
@@ -168,13 +218,13 @@
"pages": [
"public-endpoints/overview",
"public-endpoints/quickstart",
+ "public-endpoints/reference",
"public-endpoints/requests",
"public-endpoints/ai-sdk",
"public-endpoints/ai-coding-tools",
{
"group": "Models",
"pages": [
- "public-endpoints/reference",
{
"group": "Image models",
"pages": [
@@ -358,6 +408,14 @@
"tutorials/serverless/run-gemma-7b"
]
},
+ {
+ "group": "Flash",
+ "pages": [
+ "tutorials/flash/text-generation-with-transformers",
+ "tutorials/flash/image-generation-with-sdxl",
+ "tutorials/flash/build-rest-api-with-load-balancer"
+ ]
+ },
{
"group": "Pods",
"pages": [
@@ -823,6 +881,10 @@
{
"source": "/hub/public-endpoint-reference",
"destination": "/public-endpoints/reference"
+ },
+ {
+ "source": "/flash/endpoint-functions",
+ "destination": "/flash/create-endpoints"
}
]
}
diff --git a/flash/apps/apps-and-environments.mdx b/flash/apps/apps-and-environments.mdx
new file mode 100644
index 00000000..8367b955
--- /dev/null
+++ b/flash/apps/apps-and-environments.mdx
@@ -0,0 +1,158 @@
+---
+title: "Apps and environments"
+sidebarTitle: "Apps and environments"
+description: "Understanding Flash's two-level deployment structure for organizing projects and managing deployments."
+---
+
+Flash uses a two-level organizational structure for deployments: **apps** and **environments**. Understanding this structure helps you organize projects, manage multiple deployment stages, and isolate resources effectively.
+
+## Flash apps
+
+A **Flash app** is a namespace on Runpod's backend that groups all resources for a single project. The app itself is just metadata—actual cloud resources (endpoints, volumes) are created when you deploy to an environment.
+
+An app consists of:
+
+- **App registry entry**: Metadata in Runpod's system (namespace, project identifier).
+- **Environments**: Different deployment stages (dev, staging, production).
+- **Builds**: Versioned tarball artifacts containing your code and dependencies.
+- **Serverless endpoints**: Running infrastructure created per environment.
+
+Apps are created automatically when you first run `flash deploy`, or explicitly with `flash app create `.
+
+### App hierarchy
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+## Environments
+
+An **environment** is an isolated deployment stage within a Flash app (e.g., `dev`, `staging`, `production`). Each environment has its own endpoints, build version, volumes, and deployment state. Environments are completely independent—deploying to `dev` has no effect on `production`.
+
+An environment contains:
+
+- **Deployed endpoints**: Serverless workers for your `@Endpoint` functions.
+- **Build version**: The specific code version running in this environment.
+- **Volumes**: Persistent storage attached to workers.
+- **State**: Current deployment status (deploying, deployed, failed).
+
+Environments are created automatically when you deploy with `--env ` or explicitly with `flash env create `.
+
+### Environment states
+
+Environments can be in several states:
+
+| State | Description |
+|-------|-------------|
+| `deploying` | Deployment in progress (building artifacts, provisioning endpoints) |
+| `deployed` | Successfully deployed and running |
+| `failed` | Deployment failed (check logs in the [Runpod console](https://www.runpod.io/console/serverless)) |
+| `updating` | Configuration update in progress |
+
+## Builds and deployments
+
+When you run `flash deploy`, Flash creates and uploads a build artifact, then provisions endpoints:
+
+### Build process
+
+1. **Create tarball**: Flash packages your code into `.flash/artifact.tar.gz` containing:
+ - Worker Python files (`lb_worker.py`, `gpu_worker.py`, `cpu_worker.py`).
+ - Pre-installed dependencies (bundled during build).
+ - Deployment manifest (`flash_manifest.json`).
+ - Auto-generated handler code.
+
+2. **Upload artifact**: The tarball is uploaded to Runpod's storage and associated with your app as a versioned "build".
+
+3. **Provision endpoints**: For each resource configuration, Flash creates a Serverless endpoint that:
+ - Runs on pre-built Flash Docker images (`runpod/flash:latest` or `runpod/flash-cpu:latest`).
+ - Extracts your tarball and executes your code.
+ - Scales automatically based on load.
+
+4. **Activate environment**: The environment is linked to the build and endpoints.
+
+
+You're **not** building custom Docker images. Flash uses pre-built images that extract your tarball and run your code. This is why deployments are fast (no image build step) and limited to 500 MB (code and dependencies only).
+
+
+## Common environment patterns
+
+### Single environment (simple projects)
+
+For small projects or solo development, use a single environment:
+
+```bash
+flash deploy --env production
+```
+
+All deployments go to `production`. Simple, but no testing isolation.
+
+### Multiple environments (team projects)
+
+For team projects, use separate environments for development, testing, and production:
+
+```bash
+flash deploy --env dev # Development and testing
+flash deploy --env staging # QA and pre-production validation
+flash deploy --env production # Live user-facing deployment
+```
+
+Each environment is completely isolated. Deploy to `dev` for testing, `staging` for QA approval, then `production` for users.
+
+## Managing apps and environments
+
+Use the CLI to manage your apps and environments:
+
+```bash
+# Apps
+flash app list # List all apps
+flash app get # View app details
+flash app delete # Delete app and all environments
+
+# Environments
+flash env list # List environments in current app
+flash env get # View environment details
+flash env delete # Delete specific environment
+```
+
+
+Deleting an app or environment is irreversible. All endpoints and configuration are permanently removed.
+
\ No newline at end of file
diff --git a/flash/apps/build-app.mdx b/flash/apps/build-app.mdx
new file mode 100644
index 00000000..b9ad4c14
--- /dev/null
+++ b/flash/apps/build-app.mdx
@@ -0,0 +1,247 @@
+---
+title: "Build a Flash app"
+sidebarTitle: "Build a Flash app"
+description: "Create a Flash app, test it locally, and deploy it to production."
+---
+
+Flash apps let you build APIs to serve AI/ML workloads on Runpod Serverless. This guide walks you through the process of building a Flash app from scratch, from project initialization and local testing to production deployment.
+
+
+If you haven't already, we recommend starting with the [Quickstart](/flash/quickstart) guide to get a feel for how Flash `@Endpoint` functions work.
+
+
+## Requirements:
+
+- You've [created a Runpod account](/get-started/manage-accounts).
+- You've [created a Runpod API key](/get-started/api-keys).
+- You've installed [Python 3.10 (or higher)](https://www.python.org/downloads/).
+
+## Step 1: Initialize a new project
+
+Create a new directory and install Flash using [uv](https://docs.astral.sh/uv/):
+
+```bash
+# Create the project directory and navigate into it:
+mkdir flash_app
+cd flash_app
+
+# Install Flash:
+uv venv
+source .venv/bin/activate
+uv pip install runpod-flash
+```
+
+Use the `flash init` command to generate a structured project template with a preconfigured application entry point:
+
+```bash
+flash init
+```
+
+Make sure your API key is set in the environment, either by creating a `.env` file or exporting the `RUNPOD_API_KEY` environment variable:
+
+```bash
+# Set the API key as an environment variable:
+export RUNPOD_API_KEY=YOUR_API_KEY
+
+# Or create a `.env` file:
+touch .env && echo "RUNPOD_API_KEY=YOUR_API_KEY" > .env
+```
+
+Replace `YOUR_API_KEY` with your actual Runpod API key.
+
+## Step 2: Explore the project template
+
+This is the structure of the project template created by `flash init`:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+This template includes:
+
+- Example worker files with `@Endpoint` decorated functions for load-balanced and queue-based endpoints.
+- Templates for `requirements.txt`, `.env.example`, `.gitignore`, etc.
+- Pre-configured endpoint configurations for GPU and CPU workers.
+
+When you start the server, it creates API endpoints at `/gpu/hello` and `/cpu/hello`, which call the endpoint functions described in their respective worker files.
+
+## Step 3: Install Python dependencies
+
+Install required dependencies:
+
+```bash
+pip install -r requirements.txt
+```
+
+## Step 4: Configure your API key
+
+Open the `.env` template file in a text editor and add your [Runpod API key](/get-started/api-keys):
+
+```bash
+# Use your text editor of choice, e.g.
+cursor .env
+```
+
+Remove the `#` symbol from the beginning of the `RUNPOD_API_KEY` line and replace `your_api_key_here` with your actual Runpod API key:
+
+```text
+RUNPOD_API_KEY=your_api_key_here
+# FLASH_HOST=localhost
+# FLASH_PORT=8888
+# LOG_LEVEL=INFO
+```
+
+Save the file and close it.
+
+## Step 5: Start the local API server
+
+Use `flash run` to start the API server:
+
+```bash
+flash run
+```
+
+Open a new terminal tab or window and test your endpoints using cURL:
+
+```bash
+# Test the queue-based GPU endpoint
+curl -X POST http://localhost:8888/gpu_worker/runsync \
+ -H "Content-Type: application/json" \
+ -d '{"message": "Hello from the GPU!"}'
+
+# Test the load-balanced endpoint
+curl -X POST http://localhost:8888/lb_worker/process \
+ -H "Content-Type: application/json" \
+ -d '{"data": "test"}'
+```
+
+If you switch back to the terminal tab where you used `flash run`, you'll see the details of the job's progress.
+
+### Faster testing with auto-provisioning
+
+For development with multiple endpoints, use `--auto-provision` to deploy all resources before testing:
+
+```bash
+flash run --auto-provision
+```
+
+This eliminates cold-start delays by provisioning all serverless endpoints upfront. Endpoints are cached and reused across server restarts, making subsequent runs faster. Resources are identified by name, so the same endpoint won't be re-deployed if the configuration hasn't changed.
+
+## Step 6: Open the API explorer
+
+Besides starting the API server, `flash run` also starts an interactive API explorer. Point your web browser at [http://localhost:8888/docs](http://localhost:8888/docs) to explore the API.
+
+To run endpoint functions in the explorer:
+
+1. Expand one of the functions under **GPU Workers** or **CPU Workers**.
+2. Click **Try it out** and then **Execute**.
+
+You'll get a response from your workers right in the explorer.
+
+## Step 7: Customize your endpoints
+
+To customize your endpoints:
+
+1. Edit the `@Endpoint` functions in your worker files (`lb_worker.py`, `gpu_worker.py`, `cpu_worker.py`).
+2. Add new worker files for new endpoints.
+3. Test individual workers by running them as scripts (e.g., `python gpu_worker.py`).
+4. Restart the development server to pick up changes.
+
+### Example: Adding a custom GPU endpoint
+
+To add a new GPU endpoint for image generation, create a new worker file or modify an existing one. For deployed apps, each queue-based function needs its own unique endpoint configuration:
+
+```python
+from runpod_flash import Endpoint, GpuType
+
+@Endpoint(
+ name="image_generator",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
+ workers=2,
+ dependencies=["diffusers", "torch", "transformers", "pillow"]
+)
+async def generate_image(prompt: str, width: int = 512, height: int = 512) -> dict:
+ import torch
+ from diffusers import StableDiffusionPipeline
+ import base64
+ import io
+
+ pipeline = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ torch_dtype=torch.float16
+ ).to("cuda")
+
+ image = pipeline(prompt=prompt, width=width, height=height).images[0]
+
+ buffered = io.BytesIO()
+ image.save(buffered, format="PNG")
+ img_str = base64.b64encode(buffered.getvalue()).decode()
+
+ return {"image": img_str, "prompt": prompt}
+```
+
+This creates a new Serverless endpoint specifically for image generation. When deployed, it will be available at its own endpoint URL with its own `/run` or `/runsync` routes.
+
+## Step 8: Deploy to Runpod
+
+When you're ready to deploy your app to Runpod, use `flash deploy`:
+
+```bash
+flash deploy
+```
+
+This command:
+
+1. Builds your application into a deployment artifact.
+2. Uploads it to Runpod's storage.
+3. Provisions independent Serverless endpoints for each endpoint configuration.
+4. Configures service discovery for inter-endpoint communication.
+
+After deployment, you'll receive URLs for all deployed endpoints, grouped by configuration type:
+
+```text
+✓ Deployment Complete
+
+Load-balanced endpoints:
+ https://abc123xyz.api.runpod.ai (lb_worker)
+ POST /process
+ GET /health
+
+Queue-based endpoints:
+ https://api.runpod.ai/v2/def456xyz (gpu_worker)
+ https://api.runpod.ai/v2/ghi789xyz (cpu_worker)
+```
+
+All requests to deployed endpoints require authentication with your Runpod API key. For example:
+
+```bash
+# Call a load-balanced endpoint
+curl -X POST https://abc123xyz.api.runpod.ai/process \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"input": {}}'
+
+# Call a queue-based endpoint
+curl -X POST https://api.runpod.ai/v2/def456xyz/runsync \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"input": {}}'
+```
+
+For detailed deployment options including environment management, see [Deploy Flash apps](/flash/apps/deploy-apps).
+
+## Next steps
+
+- [Deploy Flash applications](/flash/apps/deploy-apps) for production use.
+- [Configure hardware resources](/flash/configuration/parameters) for your endpoints.
+- [Monitor and troubleshoot](/flash/troubleshooting) your endpoints.
diff --git a/flash/apps/customize-app.mdx b/flash/apps/customize-app.mdx
new file mode 100644
index 00000000..04058679
--- /dev/null
+++ b/flash/apps/customize-app.mdx
@@ -0,0 +1,342 @@
+---
+title: "Customize your Flash app"
+sidebarTitle: "Customize your app"
+description: "Modify the Flash project template to build your application."
+---
+
+import { LoadBalancingEndpointsTooltip, QueueBasedEndpointsTooltip } from "/snippets/tooltips.jsx";
+
+After running `flash init`, you have a working project template with example and . This guide shows you how to customize the template to build your application.
+
+## Understanding endpoint architecture
+
+The relationship between endpoint configurations and deployed Serverless endpoints differs between load-balanced and queue-based endpoints. Understanding this mapping is critical for building Flash apps correctly.
+
+### Key rules
+
+**Queue-based endpoints** follow a strict 1:1:1 rule:
+- 1 endpoint configuration : 1 `@Endpoint` function : 1 Serverless endpoint.
+- Each function must have its own unique endpoint name.
+- Each endpoint gets its own URL (e.g., `https://api.runpod.ai/v2/abc123xyz`)
+- Called via `/run` or `/runsync` routes.
+
+**Load-balanced endpoints** allow multiple routes on one endpoint:
+- 1 endpoint instance = multiple route decorators = 1 Serverless endpoint.
+- Multiple routes can share the same endpoint configuration.
+- All routes share one URL with different paths (e.g., `/generate`, `/health`).
+- Each route defined by `.get()`, `.post()`, etc. method decorators.
+
+
+**Do not reuse the same endpoint name for multiple queue-based functions when deploying Flash apps.** Each queue-based function must have its own unique `name` parameter.
+
+
+### Examples
+
+The following sections demonstrate progressively complex scenarios:
+
+
+
+**Your code:**
+
+```python title="gpu_worker.py"
+from runpod_flash import Endpoint, GpuType
+
+@Endpoint(
+ name="gpu-inference",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ dependencies=["torch"]
+)
+async def process_data(input: dict) -> dict:
+ import torch
+ # Your processing logic
+ return {"result": "processed"}
+```
+
+**What gets deployed:**
+
+- **1 Serverless endpoint**: `https://api.runpod.ai/v2/abc123xyz`
+ - Named: `gpu-inference`
+ - Hardware: A100 80GB GPUs.
+ - When you call the endpoint: A worker runs the `process_data` function using the input data you provide.
+
+**How to call it:**
+
+```bash
+# Synchronous call:
+curl -X POST https://api.runpod.ai/v2/abc123xyz/runsync \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -d '{"input": {"your": "data"}}'
+
+# Asynchronous call:
+curl -X POST https://api.runpod.ai/v2/abc123xyz/run \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -d '{"input": {"your": "data"}}'
+```
+
+**Key takeaway:** Each queue-based function must have its own unique endpoint name. Do not reuse the same name for multiple queue-based functions in Flash apps.
+
+
+
+
+**Your code:**
+
+```python title="gpu_worker.py"
+from runpod_flash import Endpoint, GpuType
+
+# Each function needs its own endpoint
+@Endpoint(
+ name="preprocess",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ dependencies=["torch"]
+)
+async def preprocess(data: dict) -> dict:
+ return {"preprocessed": data}
+
+@Endpoint(
+ name="inference",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ dependencies=["transformers"]
+)
+async def run_model(input: dict) -> dict:
+ return {"output": "result"}
+```
+
+**What gets deployed:**
+
+- **2 Serverless endpoints**:
+ 1. `https://api.runpod.ai/v2/abc123xyz` (Named: `preprocess` in the console)
+ 2. `https://api.runpod.ai/v2/def456xyz` (Named: `inference` in the console)
+
+**How to call them:**
+
+```bash
+# Call preprocess endpoint:
+curl -X POST https://api.runpod.ai/v2/abc123xyz/runsync \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -d '{"input": {"your": "data"}}'
+
+# Call inference endpoint:
+curl -X POST https://api.runpod.ai/v2/def456xyz/runsync \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -d '{"input": {"your": "data"}}'
+```
+
+**Key takeaway:** Each queue-based function must have its own unique endpoint name. Do not reuse the same name for multiple queue-based functions in Flash apps.
+
+
+
+
+
+**Your code:**
+
+```python title="lb_worker.py"
+from runpod_flash import Endpoint
+
+api = Endpoint(name="api-server", cpu="cpu5c-4-8", workers=(1, 5))
+
+@api.post("/generate")
+async def generate_text(prompt: str) -> dict:
+ return {"text": "generated"}
+
+@api.post("/translate")
+async def translate_text(text: str, target: str) -> dict:
+ return {"translated": text}
+
+@api.get("/health")
+async def health_check() -> dict:
+ return {"status": "healthy"}
+```
+
+**What gets deployed:**
+
+- **1 Serverless endpoint**: `https://abc123xyz.api.runpod.ai` (Named: `api-server`)
+- **3 HTTP routes**: `POST /generate`, `POST /translate`, `GET /health` (Defined by the route decorators in `lb_worker.py`)
+
+ **How to call them:**
+
+```bash
+# Call /generate route:
+curl -X POST https://abc123xyz.api.runpod.ai/generate \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -d '{"prompt": "hello"}'
+
+# Call /health route (same endpoint URL):
+curl -X GET https://abc123xyz.api.runpod.ai/health \
+ -H "Authorization: Bearer $RUNPOD_API_KEY"
+```
+
+**Key takeaway:** Load-balanced endpoints can have multiple routes on a single Serverless endpoint. The route decorator determines each route.
+
+
+
+
+
+**Your code:**
+
+```python title="mixed_api_worker.py"
+from runpod_flash import Endpoint, GpuType
+
+# Public-facing API (load-balanced)
+api = Endpoint(name="public-api", cpu="cpu5c-4-8", workers=(1, 5))
+
+@api.post("/process")
+async def handle_request(data: dict) -> dict:
+ # Call internal GPU worker
+ result = await run_gpu_inference(data)
+ return {"result": result}
+
+# Internal GPU worker (queue-based)
+@Endpoint(
+ name="gpu-backend",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ dependencies=["torch"]
+)
+async def run_gpu_inference(input: dict) -> dict:
+ import torch
+ # Heavy GPU computation
+ return {"inference": "result"}
+```
+
+**What gets deployed:**
+
+- **2 Serverless endpoints**:
+ 1. `https://abc123xyz.api.runpod.ai` (public-api, load-balanced)
+ 2. `https://api.runpod.ai/v2/def456xyz` (gpu-backend, queue-based)
+
+**Key takeaway:** You can mix endpoint types. Load-balanced endpoints can call queue-based endpoints internally.
+
+
+
+### Quick reference
+
+| Endpoint Type | Configuration rule | Result |
+|---------------|-------------|--------|
+| Queue-based | 1 name : 1 function | 1 Serverless endpoint |
+| Load-balanced | 1 endpoint : 1 or more routes | 1 Serverless endpoint with >= 1 paths |
+| Mixed | Different names : Different functions | Separate Serverless endpoints |
+
+## Add load balancing routes
+
+To add routes to an existing load balancing endpoint, use the route decorator pattern:
+
+```python title="lb_worker.py"
+from runpod_flash import Endpoint
+
+api = Endpoint(name="lb_worker", cpu="cpu5c-4-8", workers=(1, 5))
+
+# Existing routes
+@api.post("/process")
+async def process(input_data: dict) -> dict:
+ # ... existing code ...
+ pass
+
+# Add a new route
+@api.get("/status")
+async def get_status() -> dict:
+ return {"status": "healthy", "version": "1.0"}
+```
+
+All routes share the same `lb_worker` Serverless endpoint. Each route is accessible at its defined path.
+
+**Key points:**
+- Multiple routes can share one endpoint configuration
+- Each route has its own HTTP method and path
+- All routes on the same endpoint deploy to one Serverless endpoint
+
+## Add queue-based endpoints
+
+To add a new queue-based endpoint, create a new endpoint with a unique name:
+
+```python title="gpu_worker.py"
+from runpod_flash import Endpoint, GpuType
+
+# Existing endpoint
+@Endpoint(
+ name="gpu-inference",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ workers=3,
+ dependencies=["torch"]
+)
+async def run_inference(input: dict) -> dict:
+ import torch
+ # Inference logic
+ return {"result": "processed"}
+
+# New endpoint for a different workload
+@Endpoint(
+ name="gpu-training",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ workers=1,
+ dependencies=["torch", "transformers"]
+)
+async def train_model(config: dict) -> dict:
+ import torch
+ from transformers import Trainer
+ # Training logic
+ return {"model_path": "/models/trained"}
+```
+
+This creates two separate Serverless endpoints, each with its own URL and scaling configuration.
+
+
+**Each queue-based function must have its own unique endpoint name.** Do not assign multiple `@Endpoint` functions to the same `name` when building Flash apps.
+
+
+## Modify endpoint configurations
+
+Customize endpoint configurations for each worker function in your app. Each `@Endpoint` function can have its own GPU type, scaling parameters, and timeouts optimized for its specific workload.
+
+```python
+# Example: Different configs for different workloads
+@Endpoint(
+ name="preprocess",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, # Cost-effective for preprocessing
+ workers=(0, 5)
+)
+async def preprocess(data): ...
+
+@Endpoint(
+ name="inference",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe, # High VRAM for large models
+ workers=(1, 10) # Keep one worker ready
+)
+async def inference(data): ...
+```
+
+See [Configuration parameters](/flash/configuration/parameters) for all available options, [GPU types](/flash/configuration/gpu-types) for selecting hardware, and [Best practices](/flash/configuration/best-practices) for optimization guidance.
+
+## Test your customizations
+
+After customizing your app, test locally with `flash run`:
+
+```bash
+flash run
+```
+
+This starts a development server at http://localhost:8888 with:
+- Interactive API documentation at `/docs`
+- Auto-reload on code changes
+- Real remote execution on Runpod workers
+
+Make sure to test:
+- All HTTP routes work as expected
+- Endpoint functions execute correctly
+- Dependencies install properly
+- Error handling works
+
+## Next steps
+
+
+
+ Use `flash run` for local development and testing.
+
+
+ Deploy your application to production with `flash deploy`.
+
+
+ Complete reference for configuration options.
+
+
+ Learn more about writing and optimizing endpoint functions.
+
+
diff --git a/flash/apps/deploy-apps.mdx b/flash/apps/deploy-apps.mdx
new file mode 100644
index 00000000..11ea5869
--- /dev/null
+++ b/flash/apps/deploy-apps.mdx
@@ -0,0 +1,458 @@
+---
+title: "Deploy Flash apps to Runpod"
+sidebarTitle: "Deploy to Runpod"
+description: "Build and deploy your FastAPI app to Runpod."
+---
+
+import { LoadBalancingEndpointsTooltip, QueueBasedEndpointsTooltip } from "/snippets/tooltips.jsx";
+
+Flash provides a complete deployment workflow for taking your local development project to production. Use `flash deploy` to build and deploy your application in a single command, or use `flash build` for more control over the build process.
+
+## Deployment workflow
+
+A typical deployment workflow looks like this:
+
+1. **Create a new project**: Use [`flash init`](/flash/cli/init) to create a new project.
+2. **Develop locally**: Use [`flash run`](/flash/cli/run) to test your application. Any functions decorated with `@Endpoint` will be run on Runpod Serverless workers.
+3. **Preview** (optional): Use [`flash deploy --preview`](/flash/cli/deploy) to test locally with Docker.
+4. **Deploy**: Use [`flash deploy`](/flash/cli/deploy) to push to Runpod Serverless.
+5. **Manage**: Use [`flash env`](/flash/cli/env) and [`flash app`](/flash/cli/app) to manage your deployments.
+
+## Deploy your application
+
+When you're satisfied with your endpoint functions and ready to move to production, use `flash deploy` to build and deploy your Flash application:
+
+```bash
+flash deploy
+```
+
+This command performs the following steps:
+
+1. **Build**: Packages your code, dependencies, and manifest.
+2. **Upload**: Sends the artifact to Runpod's storage.
+3. **Provision**: Creates or updates Serverless endpoints.
+4. **Configure**: Sets up environment variables and service discovery.
+
+### Deployment architecture
+
+Flash deploys your application as multiple independent Serverless endpoints. Each endpoint configuration in your worker files becomes a separate endpoint:
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'14px','fontFamily':'font-inter'}}}%%
+
+flowchart TB
+ Users(["USERS"])
+ StateManager["Runpod GraphQL API • Service discovery • Manifest registry"]
+
+ subgraph Runpod ["RUNPOD SERVERLESS"]
+ LB["lb_worker ENDPOINT (load-balanced) • POST /process • GET /health"]
+ GPU["gpu_worker ENDPOINT (queue-based) • POST /runsync"]
+ CPU["cpu_worker ENDPOINT (queue-based) • POST /runsync"]
+
+ LB <-.->|"inter-endpoint calls"| GPU
+ LB <-.->|"inter-endpoint calls"| CPU
+
+ LB -.->|"service discovery"| StateManager
+ GPU -.->|"service discovery"| StateManager
+ CPU -.->|"service discovery"| StateManager
+ end
+
+ Users -->|"call directly"| LB
+ Users -->|"call directly"| GPU
+ Users -->|"call directly"| CPU
+
+ style Runpod fill:#1a1a2e,stroke:#5F4CFE,stroke-width:2px,color:#fff
+ style Users fill:#4D38F5,stroke:#4D38F5,color:#fff
+ style LB fill:#5F4CFE,stroke:#5F4CFE,color:#fff
+ style GPU fill:#22C55E,stroke:#22C55E,color:#000
+ style CPU fill:#22C55E,stroke:#22C55E,color:#000
+ style StateManager fill:#AE6DFF,stroke:#AE6DFF,color:#fff
+```
+
+**How Flash deployments work:**
+
+- **One endpoint name = one endpoint**: Each unique endpoint configuration (defined by its `name` parameter) creates a separate Serverless endpoint with its own URL.
+- **Call any endpoint**: After deployment, you can call whichever endpoint you need—`lb_worker` for API requests, `gpu_worker` for GPU tasks, `cpu_worker` for CPU tasks.
+- [Load balancing endpoints](/flash/create-endpoints#load-balanced-endpoints): Create HTTP APIs with custom routes using `.get()`, `.post()`, etc. decorators.
+- [Queue-based endpoints](/flash/create-endpoints#queue-based-endpoints): Run compute tasks using the `/runsync` or `/run` routes.
+- **Inter-endpoint communication**: Endpoints can call each other's functions when needed, using the Runpod GraphQL service for discovery.
+
+### Deploy to an environment
+
+Flash organizes deployments using [apps and environments](/flash/apps/apps-and-environments). Deploy to a specific environment using the `--env` flag:
+
+```bash
+# Deploy to staging
+flash deploy --env staging
+
+# Deploy to production
+flash deploy --env production
+```
+
+If the specified environment doesn't exist, Flash creates it automatically.
+
+### Post-deployment
+
+After a successful deployment, Flash displays all deployed endpoints grouped by type:
+
+```text
+✓ Deployment Complete
+
+Load-balanced endpoints:
+ https://abc123xyz.api.runpod.ai (lb_worker)
+ POST /process
+ GET /health
+
+ Try it:
+ curl -X POST https://abc123xyz.api.runpod.ai/process \
+ -H "Content-Type: application/json" \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -d '{"input": {}}'
+
+Queue-based endpoints:
+ https://api.runpod.ai/v2/def456xyz (gpu_worker)
+ https://api.runpod.ai/v2/ghi789xyz (cpu_worker)
+
+ Try it:
+ curl -X POST https://api.runpod.ai/v2/def456xyz/runsync \
+ -H "Content-Type: application/json" \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -d '{"input": {}}'
+```
+
+Each endpoint is independent with its own URL and authentication.
+
+## Understanding endpoint architecture
+
+The relationship between endpoint configurations and deployed endpoints differs between load-balanced and queue-based endpoints:
+
+### Queue-based endpoints (one function per endpoint)
+
+For queue-based endpoints, each `@Endpoint` function must have its own unique name:
+
+```python
+from runpod_flash import Endpoint, GpuType
+
+# Each function needs its own endpoint name
+@Endpoint(
+ name="run-model",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ dependencies=["torch"]
+)
+def run_model(input: dict): ...
+
+@Endpoint(
+ name="preprocess",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ dependencies=["transformers"]
+)
+def preprocess(data: dict): ...
+```
+
+This creates two separate Serverless endpoints:
+- `https://api.runpod.ai/v2/abc123xyz` (run-model)
+- `https://api.runpod.ai/v2/def456xyz` (preprocess)
+
+**Calling queue-based endpoints:**
+
+```bash
+# Call run_model endpoint (synchronous):
+curl -X POST https://api.runpod.ai/v2/abc123xyz/runsync \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"input": {"your": "data"}}'
+
+# Or call asynchronously with /run:
+curl -X POST https://api.runpod.ai/v2/abc123xyz/run \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"input": {"your": "data"}}'
+```
+
+
+**Important:** For deployed queue-based endpoints, you must use **one function per endpoint name**. Each function creates its own Serverless endpoint. Do not put multiple `@Endpoint` functions with the same name when building Flash apps.
+
+
+### Load-balanced endpoints (multiple routes per endpoint)
+
+For load-balanced endpoints, you can define multiple HTTP routes on a single endpoint:
+
+```python
+from runpod_flash import Endpoint
+
+api = Endpoint(name="api", cpu="cpu5c-4-8", workers=(1, 5))
+
+# Multiple routes on a single Serverless endpoint:
+@api.post("/generate")
+def generate_text(prompt: str): ...
+
+@api.post("/translate")
+def translate_text(text: str): ...
+
+@api.get("/health")
+def health_check(): ...
+```
+
+This creates:
+- **One Serverless endpoint**: `https://abc123xyz.api.runpod.ai` (named "api")
+- **Three HTTP routes**: `POST /generate`, `POST /translate`, `GET /health`
+
+**Calling load-balanced endpoints:**
+
+```bash
+# Call the /generate route:
+curl -X POST https://abc123xyz.api.runpod.ai/generate \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"prompt": "hello"}'
+
+# Call the /health route (same endpoint URL):
+curl -X GET https://abc123xyz.api.runpod.ai/health \
+ -H "Authorization: Bearer $RUNPOD_API_KEY"
+```
+
+### Key takeaway
+
+- **Queue-based**: 1 endpoint name = 1 function = 1 Serverless endpoint
+- **Load-balanced**: 1 endpoint instance = multiple routes = 1 Serverless endpoint
+
+## Preview before deploying
+
+Test your deployment locally using Docker before pushing to production using the `--preview` flag:
+
+```bash
+flash deploy --preview
+```
+
+This command:
+
+1. Builds your project (creates the deployment artifact and manifest).
+2. Creates a Docker network for inter-container communication.
+3. Starts one container per endpoint configuration (`lb_worker`, `gpu_worker`, `cpu_worker`, etc.).
+4. Exposes all endpoints for local testing.
+
+Use preview mode to:
+
+- Validate your deployment configuration.
+- Test cross-endpoint function calls.
+- Debug resource provisioning issues.
+- Verify the manifest structure.
+
+Press `Ctrl+C` to stop the preview environment.
+
+## Managing deployment size
+
+Runpod Serverless has a **500MB deployment limit**. Flash automatically excludes packages that are pre-installed in the base image:
+
+- `torch`, `torchvision`, `torchaudio`
+- `numpy`, `triton`
+
+If your deployment still exceeds the limit, use the `--exclude` flag to skip additional packages:
+
+```bash
+flash deploy --exclude scipy,pandas
+```
+
+### Base image packages
+
+| Configuration type | Base image | Auto-excluded packages |
+|--------------|------------|------------------------|
+| GPU (`gpu=`) | PyTorch base | `torch`, `torchvision`, `torchaudio`, `numpy`, `triton` |
+| CPU (`cpu=`) | Python slim | `torch`, `torchvision`, `torchaudio`, `numpy`, `triton` |
+| Load-balanced | Same as GPU/CPU | Same as GPU/CPU |
+
+
+
+Check the [worker-flash repository](https://github.com/runpod-workers/worker-flash) for current base images and pre-installed packages.
+
+
+
+## Build process
+
+When you run `flash deploy` (or `flash build`), Flash:
+
+1. **Discovers** all `@Endpoint` decorated functions.
+2. **Groups** functions by their endpoint name.
+3. **Generates** handler files for each endpoint.
+4. **Creates** a `flash_manifest.json` file for service discovery.
+5. **Installs** dependencies with Linux x86_64 compatibility.
+6. **Packages** everything into `.flash/artifact.tar.gz`.
+
+### Cross-platform builds
+
+Flash automatically handles cross-platform builds. You can build on macOS, Windows, or Linux, and the resulting package will run correctly on Runpod's Linux x86_64 infrastructure.
+
+### Build artifacts
+
+After building, these artifacts are created in the `.flash/` directory:
+
+| Artifact | Description |
+|----------|-------------|
+| `.flash/artifact.tar.gz` | Deployment package |
+| `.flash/flash_manifest.json` | Service discovery configuration |
+| `.flash/.build/` | Temporary build directory (removed by default) |
+
+## What gets deployed to Runpod
+
+When you deploy a Flash app, you're deploying a **build artifact** (tarball) onto pre-built Flash Docker images. This architecture is similar to AWS Lambda layers: the base runtime is pre-built, and your code and dependencies are layered on top.
+
+### The build artifact
+
+The `.flash/artifact.tar.gz` file (max 500 MB) contains:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Dependencies are installed locally during the build process and bundled into the tarball. They are **not** installed at runtime on endpoints.
+
+### The deployment manifest
+
+The `flash_manifest.json` file is the brain of your deployment. It tells each endpoint:
+
+- Which functions to execute.
+- What Docker image to use.
+- How to configure resources (GPUs, workers, scaling).
+- How to route HTTP requests (for load balancer endpoints).
+
+```json
+{
+ "resources": {
+ "lb_worker": {
+ "is_load_balanced": true,
+ "imageName": "runpod/flash-lb-cpu:latest",
+ "workersMin": 1,
+ "functions": [
+ {"name": "process", "module": "lb_worker"},
+ {"name": "health", "module": "lb_worker"}
+ ]
+ },
+ "gpu_worker": {
+ "imageName": "runpod/flash:latest",
+ "gpuIds": "AMPERE_16",
+ "workersMax": 3,
+ "functions": [
+ {"name": "gpu_hello", "module": "gpu_worker"}
+ ]
+ },
+ "cpu_worker": {
+ "imageName": "runpod/flash-cpu:latest",
+ "workersMax": 2,
+ "functions": [
+ {"name": "cpu_hello", "module": "cpu_worker"}
+ ]
+ }
+ },
+ "routes": {
+ "lb_worker": {
+ "POST /process": "process",
+ "GET /health": "health"
+ }
+ }
+}
+```
+
+### What gets created on Runpod
+
+For each endpoint configuration in the manifest, Flash creates an independent Serverless endpoint. Each endpoint runs as its own service with its own URL.
+
+**load-balanced endpoints** ([load balancer](/serverless/load-balancing/overview))
+
+- **Purpose**: HTTP-facing services for custom API routes
+- **Image**: Pre-built `runpod/flash-lb-cpu:latest` or `runpod/flash-lb:latest`
+- **Use cases**: REST APIs, webhooks, public-facing services
+- **Example**: `lb_worker.py` with `@api.post("/process")`
+- **Routes**: Custom HTTP endpoints defined in your route decorators
+- **Startup process**:
+ 1. Container extracts your tarball
+ 2. Auto-generated handler imports your worker file (e.g., `lb_worker.py`)
+ 3. Routes are registered from decorators
+ 4. Uvicorn server starts on port 8000
+- **Service discovery**: Queries the state manager for cross-endpoint calls
+
+**queue-based endpoints** (serverless compute)
+
+- **Purpose**: Background compute for intensive `@Endpoint` functions
+- **Image**: Pre-built `runpod/flash:latest` (GPU) or `runpod/flash-cpu:latest` (CPU)
+- **Use cases**: GPU inference, batch processing, heavy computation
+- **Example**: `gpu_worker.py` with `@Endpoint(name="...", gpu=...)`
+- **Routes**: Automatic `/runsync` endpoint for job submission
+- **Startup process**:
+ 1. Container extracts your tarball
+ 2. Worker module is imported (e.g., `gpu_worker.py`)
+ 3. Function registry maps function names to callables
+ 4. Worker listens for jobs from job queue
+- **Execution**: Sequential job processing with automatic retry logic
+- **Service discovery**: Queries the state manager for cross-endpoint calls
+
+### Cross-endpoint communication
+
+When one endpoint needs to call a function on another endpoint:
+
+1. **Manifest lookup**: Calling endpoint checks `flash_manifest.json` for function-to-resource mapping
+2. **Service discovery**: Queries the state manager (Runpod GraphQL API) for target endpoint URL
+3. **Direct call**: Makes HTTP request directly to target endpoint
+4. **Response**: Target endpoint executes function and returns result
+
+Each endpoint maintains its own connection to the state manager, querying for peer endpoint URLs as needed and caching results for 300 seconds to minimize API calls.
+
+## Troubleshooting
+
+### No @Endpoint functions found
+
+If the build process can't find your endpoint functions:
+
+- Ensure functions are decorated with `@Endpoint(...)`.
+- Check that Python files aren't excluded by `.gitignore` or `.flashignore`.
+- Verify decorator syntax is correct.
+
+### Deployment size limit exceeded
+
+Base image packages are auto-excluded. If your deployment still exceeds 500MB, use `--exclude` to skip additional packages:
+
+```bash
+flash deploy --exclude scipy,pandas
+```
+
+### Authentication errors
+
+Verify your API key is set correctly:
+
+```bash
+echo $RUNPOD_API_KEY
+```
+
+If not set, add it to your `.env` file or export it:
+
+```bash
+export RUNPOD_API_KEY=your_api_key_here
+```
+
+### Import errors in endpoint functions
+
+Import packages inside the endpoint function, not at the top of the file:
+
+```python
+@Endpoint(name="fetch-data", gpu=GpuGroup.ANY, dependencies=["requests"])
+def fetch_data(url):
+ import requests # Import here
+ return requests.get(url).json()
+```
+
+## Next steps
+
+- [Learn about apps and environments](/flash/apps/apps-and-environments) for managing deployments.
+- [View the CLI reference](/flash/cli/overview) for all available commands.
+- [Configure hardware resources](/flash/configuration/parameters) for your endpoints.
+- [Monitor and troubleshoot](/flash/troubleshooting) your deployments.
diff --git a/flash/apps/initialize-project.mdx b/flash/apps/initialize-project.mdx
new file mode 100644
index 00000000..88c5d528
--- /dev/null
+++ b/flash/apps/initialize-project.mdx
@@ -0,0 +1,142 @@
+---
+title: "Initialize a Flash app project"
+sidebarTitle: "Initialize a project"
+description: "Use flash init to create a new Flash project with a ready-to-use structure."
+---
+
+import { LoadBalancingEndpointsTooltip, QueueBasedEndpointsTooltip } from "/snippets/tooltips.jsx";
+
+The `flash init` command creates a new Flash project with a complete project structure, including example and , and configuration files. This gives you a working starting point for building Flash applications.
+
+Use `flash init` whenever you want to start a new Flash project, fully configured for you to run `flash run` and `flash deploy`.
+
+## Create a new project
+
+Create a new project in a new directory:
+
+```bash
+flash init PROJECT_NAME
+cd PROJECT_NAME
+```
+
+Or initialize in your current directory:
+
+```bash
+flash init .
+```
+
+## Project structure
+
+`flash init` creates the following structure:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+### Key files
+
+**lb_worker.py**: An example load-balanced worker with HTTP routes. Contains `@Endpoint` functions with custom HTTP methods and paths (e.g., `POST /process`, `GET /health`). Creates one endpoint when deployed.
+
+**gpu_worker.py**: An example GPU queue-based worker. Contains `@Endpoint` functions that run on GPU hardware. Provides `/runsync` route for job submission. Creates one endpoint when deployed.
+
+**cpu_worker.py**: An example CPU queue-based worker. Contains `@Endpoint` functions that run on CPU-only instances. Provides `/runsync` route for job submission. Creates one endpoint when deployed.
+
+**.flashignore**: Lists files and directories to exclude from the deployment artifact (similar to `.gitignore`).
+
+Each worker file defines a resource configuration and its associated functions. When you deploy, Flash creates one Serverless endpoint per unique resource configuration.
+
+## Set up the project
+
+After initialization, complete the setup:
+
+```bash
+# Install dependencies
+pip install -r requirements.txt
+
+# Copy environment template
+cp .env.example .env
+
+# Add your API key to .env
+# RUNPOD_API_KEY=your_api_key_here
+```
+
+## How it fits into the workflow
+
+`flash init` is the first step in the Flash development workflow:
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'14px','fontFamily':'font-inter'}}}%%
+
+flowchart LR
+ Init["flash init"]
+ Dev["flash run"]
+ Deploy["flash deploy"]
+
+ Init -->|"Create project"| Dev
+ Dev -->|"Test locally"| Deploy
+
+ style Init fill:#5F4CFE,stroke:#5F4CFE,color:#fff
+ style Dev fill:#22C55E,stroke:#22C55E,color:#000
+ style Deploy fill:#4D38F5,stroke:#4D38F5,color:#fff
+```
+
+1. **`flash init`**: Creates project structure and boilerplate.
+2. **`flash run`**: Starts local development server for testing.
+3. **`flash deploy`**: Builds and deploys to Runpod Serverless.
+
+## Handle existing files
+
+If you run `flash init` in a directory with existing files, Flash detects conflicts and prompts for confirmation:
+
+```text
+┌─ File Conflicts Detected ─────────────────────┐
+│ Warning: The following files will be │
+│ overwritten: │
+│ │
+│ • lb_worker.py │
+│ • gpu_worker.py │
+│ • requirements.txt │
+└───────────────────────────────────────────────┘
+Continue and overwrite these files? [y/N]:
+```
+
+Use `--force` to skip the prompt and overwrite files:
+
+```bash
+flash init . --force
+```
+
+## Start developing
+
+Once your project is set up:
+
+```bash
+# Start the development server
+flash run
+
+# Open the API explorer
+# http://localhost:8888/docs
+```
+
+Make changes to your worker files, and the server reloads automatically. When you're ready, deploy with:
+
+```bash
+flash deploy
+```
+
+## Next steps
+
+- [Customize your app](/flash/apps/customize-app) to add endpoints and modify configurations.
+- [Test locally](/flash/apps/local-testing) with `flash run`.
+- [Deploy to production](/flash/apps/deploy-apps) with `flash deploy`.
+- [View the flash init reference](/flash/cli/init) for all options.
diff --git a/flash/apps/local-testing.mdx b/flash/apps/local-testing.mdx
new file mode 100644
index 00000000..f4b52120
--- /dev/null
+++ b/flash/apps/local-testing.mdx
@@ -0,0 +1,194 @@
+---
+title: "Test Flash apps locally"
+sidebarTitle: "Test locally"
+description: "Use flash run to test your Flash application locally before deploying."
+---
+
+The `flash run` command starts a local development server that lets you test your Flash application before deploying to production. The development server runs locally and updates automatically as you edit files. When you call a `@Endpoint` function, Flash sends the latest function code to Serverless workers on Runpod, so your changes are reflected immediately.
+
+Use `flash run` when you want to:
+
+- Iterate quickly with automatic code updates.
+- Test `@Endpoint` functions against real GPU/CPU workers.
+- Debug request/response handling before deployment.
+- Develop without redeploying after every change.
+
+## Start the development server
+
+From inside your [project directory](/flash/apps/initialize-project), run:
+
+```bash
+flash run
+```
+
+The server starts at `http://localhost:8888` by default. Your endpoints are available immediately for testing, and `@Endpoint` functions provision Serverless endpoints on first call.
+
+### Custom host and port
+
+```bash
+# Change port
+flash run --port 3000
+
+# Make accessible on network
+flash run --host 0.0.0.0
+```
+
+## Test your endpoints
+
+### Using curl
+
+```bash
+# Call a queue-based endpoint (gpu_worker.py)
+curl -X POST http://localhost:8888/gpu_worker/runsync \
+ -H "Content-Type: application/json" \
+ -d '{"message": "Hello from Flash"}'
+
+# Call a load-balanced endpoint (lb_worker.py)
+curl -X POST http://localhost:8888/lb_worker/process \
+ -H "Content-Type: application/json" \
+ -d '{"data": "test"}'
+```
+
+### Using the API explorer
+
+Open [http://localhost:8888/docs](http://localhost:8888/docs) in your browser to access the interactive Swagger UI. You can test all endpoints directly from the browser.
+
+### Using Python
+
+```python
+import requests
+
+# Call queue-based endpoint
+response = requests.post(
+ "http://localhost:8888/gpu_worker/runsync",
+ json={"message": "Hello from Flash"}
+)
+print(response.json())
+
+# Call load-balanced endpoint
+response = requests.post(
+ "http://localhost:8888/lb_worker/process",
+ json={"data": "test"}
+)
+print(response.json())
+```
+
+## Reduce cold-start delays
+
+The first call to a `@Endpoint` function provisions a Serverless endpoint, which takes 30-60 seconds. Use `--auto-provision` to provision all endpoints at startup:
+
+```bash
+flash run --auto-provision
+```
+
+This scans your project for `@Endpoint` functions and deploys them before the server starts accepting requests. Endpoints are cached in `.runpod/resources.pkl` and reused across server restarts.
+
+## How it works
+
+With `flash run`, Flash starts a local development server alongside remote Serverless endpoints:
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'14px','fontFamily':'font-inter'}}}%%
+
+flowchart TB
+ Browser(["BROWSER/CURL"])
+
+ subgraph Local ["YOUR MACHINE (localhost:8888)"]
+ DevServer["Development Server • Auto-reload on changes • API explorer at /docs • Routes requests"]
+ end
+
+ subgraph Runpod ["RUNPOD SERVERLESS"]
+ LB["live-lb_worker"]
+ GPU["live-gpu_worker"]
+ CPU["live-cpu_worker"]
+ end
+
+ Browser -->|"HTTP"| DevServer
+ DevServer -->|"HTTPS"| LB
+ DevServer -->|"HTTPS"| GPU
+ DevServer -->|"HTTPS"| CPU
+
+ style Local fill:#1a1a2e,stroke:#5F4CFE,stroke-width:2px,color:#fff
+ style Runpod fill:#1a1a2e,stroke:#5F4CFE,stroke-width:2px,color:#fff
+ style Browser fill:#4D38F5,stroke:#4D38F5,color:#fff
+ style DevServer fill:#5F4CFE,stroke:#5F4CFE,color:#fff
+ style LB fill:#22C55E,stroke:#22C55E,color:#000
+ style GPU fill:#22C55E,stroke:#22C55E,color:#000
+ style CPU fill:#22C55E,stroke:#22C55E,color:#000
+```
+
+**What runs where:**
+
+| Component | Location |
+|-----------|----------|
+| Development server | Your machine (localhost:8888) |
+| `@Endpoint` function code | Runpod Serverless |
+| Endpoint storage | Runpod Serverless |
+
+Your code updates automatically as you edit files. Endpoints created by `flash run` are prefixed with `live-` to distinguish them from production endpoints.
+
+## Development workflow
+
+A typical development cycle looks like this:
+
+1. Start the server: `flash run`
+2. Make changes to your code.
+3. The server reloads automatically.
+4. Test your changes via curl or the API explorer.
+5. Repeat until ready to deploy.
+
+When you're done, use `flash undeploy` to clean up the `live-` endpoints created during development.
+
+## Differences from production
+
+| Aspect | `flash run` | `flash deploy` |
+|--------|-------------|----------------|
+| FastAPI app runs on | Your machine | Runpod Serverless |
+| Endpoint naming | `live-` prefix | No prefix |
+| Automatic updates | Yes | No |
+| Authentication | Not required | Required |
+
+## Clean up after testing
+
+Endpoints created by `flash run` persist until you delete them. To clean up:
+
+```bash
+# List all endpoints
+flash undeploy list
+
+# Remove a specific endpoint
+flash undeploy ENDPOINT_NAME
+
+# Remove all endpoints
+flash undeploy --all
+```
+
+## Troubleshooting
+
+**Port already in use**
+
+```bash
+flash run --port 3000
+```
+
+**Slow first request**
+
+Use `--auto-provision` to eliminate cold-start delays:
+
+```bash
+flash run --auto-provision
+```
+
+**Authentication errors**
+
+Ensure `RUNPOD_API_KEY` is set in your `.env` file or environment:
+
+```bash
+export RUNPOD_API_KEY=your_api_key_here
+```
+
+## Next steps
+
+- [Deploy to production](/flash/apps/deploy-apps) when your app is ready.
+- [Clean up endpoints](/flash/cli/undeploy) after testing.
+- [View the flash run reference](/flash/cli/run) for all options.
diff --git a/flash/apps/overview.mdx b/flash/apps/overview.mdx
new file mode 100644
index 00000000..d5738f05
--- /dev/null
+++ b/flash/apps/overview.mdx
@@ -0,0 +1,314 @@
+---
+title: "Overview"
+sidebarTitle: "Overview"
+description: "Understand the Flash app development lifecycle."
+---
+
+import { ServerlessTooltip } from "/snippets/tooltips.jsx";
+
+A Flash app is a collection of endpoints deployed to Runpod. When you deploy an app, Runpod:
+
+1. Packages your code, dependencies, and deployment manifest into a tarball (max 500 MB).
+2. Uploads the tarball to Runpod.
+3. Provisions independent Serverless endpoints based on your [endpoint configurations](/flash/create-endpoints).
+
+This page explains the key concepts and processes you'll use when building Flash apps.
+
+
+If you prefer to learn by doing, follow this tuturial to [build your first Flash app](/flash/apps/build-app).
+
+
+
+## App development overview
+
+Building a Flash application follows a clear progression from initialization to production deployment:
+
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'14px','fontFamily':'font-inter'}}}%%
+
+flowchart TB
+ Init["flash init Create project"]
+ Code["Define endpoints with @Endpoint functions"]
+ Run["Test locally with flash run"]
+ Deploy["Deploy to Runpod with flash deploy"]
+ Manage["Manage apps and environments with flash app and flash env"]
+
+ Init --> Code
+ Code --> Run
+ Run -->|"Ready for production"| Deploy
+ Deploy --> Manage
+ Run -->|"Continue development"| Code
+
+ style Init fill:#5F4CFE,stroke:#5F4CFE,color:#fff
+ style Code fill:#22C55E,stroke:#22C55E,color:#000
+ style Run fill:#4D38F5,stroke:#4D38F5,color:#fff
+ style Deploy fill:#AE6DFF,stroke:#AE6DFF,color:#000
+ style Manage fill:#9289FE,stroke:#9289FE,color:#fff
+```
+
+
+
+
+ Use `flash init` to create a new project with example workers:
+
+ ```bash
+ flash init PROJECT_NAME
+ cd PROJECT_NAME
+ pip install -r requirements.txt
+ ```
+
+ This gives you a working project structure with GPU and CPU worker examples. [Learn more about project initialization](/flash/apps/initialize-project).
+
+
+
+ Write your application code by defining `Endpoint` functions that execute on Runpod workers:
+
+ ```python
+ from runpod_flash import Endpoint, GpuType
+
+ @Endpoint(
+ name="inference-worker",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
+ workers=3,
+ dependencies=["torch"]
+ )
+ def run_inference(prompt: str) -> dict:
+ import torch
+ # Your inference logic here
+ return {"result": "..."}
+ ```
+
+ [Learn more about endpoint functions](/flash/create-endpoints).
+
+
+
+ Start a local development server to test your application:
+
+ ```bash
+ flash run
+ ```
+
+ Your app runs locally and updates automatically. When you call an `@Endpoint` function, Flash sends the latest code to Runpod workers. This hybrid architecture lets you iterate quickly without redeploying. [Learn more about local testing](/flash/apps/local-testing).
+
+
+
+ When ready for production, deploy your application to Runpod Serverless:
+
+ ```bash
+ flash deploy
+ ```
+
+ Your entire application—including all worker functions—runs on Runpod infrastructure. [Learn more about deployment](/flash/apps/deploy-apps).
+
+
+
+ Use apps and environments to organize and manage your deployments across different stages (dev, staging, production). [Learn more about apps and environments](/flash/apps/apps-and-environments).
+
+
+
+## Apps and environments
+
+Flash uses a two-level organizational structure: **apps** (project containers) and **environments** (deployment stages like dev, staging, production). See [Apps and environments](/flash/apps/apps-and-environments) for complete details.
+
+## Local vs production deployment
+
+Flash supports two modes of operation:
+
+### Local development (`flash run`)
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'14px','fontFamily':'font-inter'}}}%%
+
+flowchart TB
+ subgraph Local ["YOUR MACHINE"]
+ DevServer["Development Server • Auto-reload on changes • localhost:8888"]
+ end
+
+ subgraph Runpod ["RUNPOD SERVERLESS"]
+ LB["live-lb_worker"]
+ GPU["live-gpu_worker"]
+ CPU["live-cpu_worker"]
+ end
+
+ DevServer -->|"HTTPS"| LB
+ DevServer -->|"HTTPS"| GPU
+ DevServer -->|"HTTPS"| CPU
+
+ style Local fill:#1a1a2e,stroke:#5F4CFE,stroke-width:2px,color:#fff
+ style Runpod fill:#1a1a2e,stroke:#22C55E,stroke-width:2px,color:#fff
+ style DevServer fill:#5F4CFE,stroke:#5F4CFE,color:#fff
+ style LB fill:#22C55E,stroke:#22C55E,color:#000
+ style GPU fill:#22C55E,stroke:#22C55E,color:#000
+ style CPU fill:#22C55E,stroke:#22C55E,color:#000
+```
+
+**How it works:**
+
+- Development server runs on your machine and updates automatically.
+- `@Endpoint` functions deploy to Runpod endpoints (one for each endpoint configuration).
+- Endpoints are prefixed with `live-` for easy identification.
+- No authentication required for local testing.
+- Fast iteration on application logic.
+
+### Production deployment (`flash deploy`)
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'14px','fontFamily':'font-inter'}}}%%
+
+flowchart TB
+ Users(["USERS"])
+ StateManager["Runpod GraphQL API • Service discovery • Manifest registry"]
+
+ subgraph Runpod ["RUNPOD SERVERLESS"]
+ LB["lb_worker ENDPOINT (load-balanced) • POST /process • GET /health"]
+ GPU["gpu_worker ENDPOINT (queue-based) • POST /runsync"]
+ CPU["cpu_worker ENDPOINT (queue-based) • POST /runsync"]
+
+ LB <-.->|"inter-endpoint calls"| GPU
+ LB <-.->|"inter-endpoint calls"| CPU
+
+ LB -.->|"service discovery"| StateManager
+ GPU -.->|"service discovery"| StateManager
+ CPU -.->|"service discovery"| StateManager
+ end
+
+ Users -->|"HTTPS (auth required)"| LB
+ Users -->|"HTTPS (auth required)"| GPU
+ Users -->|"HTTPS (auth required)"| CPU
+
+ style Runpod fill:#1a1a2e,stroke:#5F4CFE,stroke-width:2px,color:#fff
+ style Users fill:#4D38F5,stroke:#4D38F5,color:#fff
+ style LB fill:#5F4CFE,stroke:#5F4CFE,color:#fff
+ style GPU fill:#22C55E,stroke:#22C55E,color:#000
+ style CPU fill:#22C55E,stroke:#22C55E,color:#000
+ style StateManager fill:#AE6DFF,stroke:#AE6DFF,color:#fff
+```
+
+**How it works:**
+
+- All endpoints run independently on Runpod Serverless (one for each endpoint configuration).
+- Each endpoint has its own public HTTPS URL.
+- API key authentication is required for all requests.
+- Automatic scaling based on load.
+- Production-grade reliability and performance.
+
+### Endpoint functions vs. Serverless endpoints
+
+Understanding the relationship between your code (endpoint functions) and deployed infrastructure (Serverless endpoints) is crucial for building Flash apps.
+
+**Serverless endpoints** are the underlying infrastructure Flash creates on Runpod. Each unique endpoint configuration (defined by its `name` parameter) creates one Serverless endpoint with specific hardware (GPU type, worker count, etc.). Each Serverless endpoint gets its own public HTTPS URL (e.g., `https://abc123xyz.api.runpod.ai` for load-balanced or `https://api.runpod.ai/v2/abc123xyz` for queue-based).
+
+You call these endpoints to execute your functions. The endpoint configuration type determines the behavior and HTTPS URL of the endpoint:
+
+- **For queue-based endpoints**: You can only have one function per endpoint, which will be executed when you call `/runsync` or `/run` on the endpoint.
+- **For load-balanced endpoints**: You can have multiple functions with different HTTP routes per endpoint, which will be executed when you call the endpoint with the appropriate HTTP method and path.
+
+#### Queue-based example
+
+Queue-based endpoints must have exactly one function defined per endpoint configuration, which will be executed when you call the `/runsync` or `/run` route on the endpoint.
+
+```python
+from runpod_flash import Endpoint, GpuType
+
+# Each queue-based function needs its own endpoint configuration
+@Endpoint(name="preprocess", gpu=GpuType.NVIDIA_A100_80GB_PCIe)
+def preprocess(data): ...
+
+@Endpoint(name="inference", gpu=GpuType.NVIDIA_A100_80GB_PCIe)
+def run_model(input): ...
+```
+
+This creates two separate Serverless endpoints, each with its own public HTTPS URL and `/run` or `/runsync` route.
+
+The URL depends on your endpoint ID, which is randomly generated when you deploy your app. For example, if your endpoint ID is `fexh32emkg3az7`, the `/runsync` URL will be `https://api.runpod.ai/v2/fexh32emkg3az7/runsync`.
+
+#### Load-balancing example
+
+Load-balancing endpoints can have multiple routes on a single Serverless endpoint. Use the route decorator pattern:
+
+```python
+from runpod_flash import Endpoint
+
+# One endpoint can host multiple HTTP routes
+api = Endpoint(name="api-server", cpu="cpu5c-4-8", workers=(1, 5))
+
+@api.post("/generate")
+def generate_text(prompt: str): ...
+
+@api.get("/health")
+def health_check(): ...
+```
+
+This creates one Serverless endpoint with multiple routes: `POST /generate` and `GET /health`, which will be executed when you call the endpoint with the appropriate HTTP method and path.
+
+The final endpoint URL depends on your endpoint ID, which is randomly generated when you deploy your app, and the HTTP routes defined in your decorators. For example, if your endpoint ID is `l66m1rhm9dhbjd`, the `/generate` route will be available at `https://l66m1rhm9dhbjd.api.runpod.ai/generate`.
+
+[Learn more about endpoint mapping](/flash/apps/customize-app#understanding-endpoint-architecture).
+
+## Common workflows
+
+### Simple projects (single environment)
+
+For solo projects or simple applications:
+
+```bash
+# Initialize and develop
+flash init PROJECT_NAME
+cd PROJECT_NAME
+
+# Test locally
+flash run
+
+# Deploy to production (creates 'production' environment by default)
+flash deploy
+```
+
+### Team projects (multiple environments)
+
+For team collaboration with dev, staging, and production stages:
+
+```bash
+# Create environments
+flash env create dev
+flash env create staging
+flash env create production
+
+# Development cycle
+flash run # Test locally
+flash deploy --env dev # Deploy to dev for integration testing
+flash deploy --env staging # Deploy to staging for QA
+flash deploy --env production # Deploy to production after approval
+```
+
+### Feature development
+
+For testing new features in isolation:
+
+```bash
+# Create temporary feature environment
+flash env create FEATURE_NAME
+
+# Deploy and test
+flash deploy --env FEATURE_NAME
+
+# Clean up after merging
+flash env delete FEATURE_NAME
+```
+
+## Next steps
+
+
+
+ Create a Flash app, test it locally, and deploy it to production.
+
+
+ Create boilerplate code for a new Flash project with `flash init`.
+
+
+ Use `flash run` for local development and testing.
+
+
+ Deploy your application to production with `flash deploy`.
+
+
diff --git a/flash/apps/requests.mdx b/flash/apps/requests.mdx
new file mode 100644
index 00000000..39d893e9
--- /dev/null
+++ b/flash/apps/requests.mdx
@@ -0,0 +1,262 @@
+---
+title: "Send API requests to deployed endpoints"
+sidebarTitle: "Send API requests"
+description: "Call your deployed Flash endpoints using HTTP requests for queue-based and load-balanced configurations."
+---
+
+After deploying your Flash app with `flash deploy`, you can call your endpoints directly via HTTP. The request format depends on whether you're using queue-based or load-balanced configurations.
+
+## Authentication
+
+All deployed endpoints require authentication with your Runpod API key:
+
+```bash
+export RUNPOD_API_KEY="your_key_here"
+
+curl -X POST https://YOUR_ENDPOINT_URL/path \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"param": "value"}'
+```
+
+
+Your endpoint URLs are displayed after running `flash deploy`. You can also view them with `flash env get `.
+
+
+## Queue-based endpoints
+
+Queue-based endpoints (using `@Endpoint(name=..., gpu=...)` decorator) provide two routes for job submission: `/run` (asynchronous) and `/runsync` (synchronous).
+
+### Asynchronous calls (`/run`)
+
+Submit a job and receive a job ID for later status checking:
+
+```bash
+curl -X POST https://api.runpod.ai/v2/abc123xyz/run \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"input": {"prompt": "Hello world"}}'
+```
+
+**Response:**
+```json
+{
+ "id": "job-abc-123",
+ "status": "IN_QUEUE"
+}
+```
+
+**Check status later:**
+```bash
+curl https://api.runpod.ai/v2/abc123xyz/status/job-abc-123 \
+ -H "Authorization: Bearer $RUNPOD_API_KEY"
+```
+
+**When job completes:**
+```json
+{
+ "id": "job-abc-123",
+ "status": "COMPLETED",
+ "output": {
+ "generated_text": "Hello world from GPU!"
+ }
+}
+```
+
+### Synchronous calls (`/runsync`)
+
+Wait for job completion and receive results directly (with timeout):
+
+```bash
+curl -X POST https://api.runpod.ai/v2/abc123xyz/runsync \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"input": {"prompt": "Hello world"}}'
+```
+
+**Response (after job completes):**
+```json
+{
+ "id": "job-abc-123",
+ "status": "COMPLETED",
+ "output": {
+ "generated_text": "Hello world from GPU!"
+ }
+}
+```
+
+
+Use `/run` for long-running jobs that you'll check later. Use `/runsync` for quick jobs where you want immediate results (with timeout protection).
+
+
+### Queue-based request format
+
+Queue-based endpoints expect input wrapped in an `{"input": {...}}` object:
+
+```bash
+curl -X POST https://api.runpod.ai/v2/abc123xyz/runsync \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{
+ "input": {
+ "param1": "value1",
+ "param2": "value2"
+ }
+ }'
+```
+
+The structure inside `"input"` depends on your `@Endpoint` function signature.
+
+### Job status states
+
+| Status | Description |
+|--------|-------------|
+| `IN_QUEUE` | Waiting for an available worker |
+| `IN_PROGRESS` | Worker is executing your function |
+| `COMPLETED` | Function finished successfully |
+| `FAILED` | Execution encountered an error |
+
+## Load-balanced endpoints
+
+Load-balanced endpoints (using `api = Endpoint(...); @api.post("/path")` pattern) provide custom HTTP routes with direct request/response patterns.
+
+### Calling load-balanced routes
+
+All routes share the same base URL. Append the route path to call specific functions:
+
+```bash
+# POST route
+curl -X POST https://abc123xyz.api.runpod.ai/analyze \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"text": "Hello world from Flash"}'
+
+# GET route
+curl -X GET https://abc123xyz.api.runpod.ai/info \
+ -H "Authorization: Bearer $RUNPOD_API_KEY"
+
+# Another POST route (same endpoint URL)
+curl -X POST https://abc123xyz.api.runpod.ai/validate \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"name": "Alice", "email": "alice@example.com"}'
+```
+
+### Load-balanced request format
+
+Load-balanced endpoints accept direct JSON payloads (no `{"input": {...}}` wrapper):
+
+```bash
+curl -X POST https://abc123xyz.api.runpod.ai/process \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{
+ "param1": "value1",
+ "param2": "value2"
+ }'
+```
+
+The payload structure depends on your function signature. Each route can accept different parameters.
+
+### Multiple routes, single endpoint
+
+A single load-balanced endpoint can serve multiple routes:
+
+```python
+from runpod_flash import Endpoint
+
+api = Endpoint(name="api-server", cpu="cpu5c-4-8", workers=(1, 5))
+
+# All these routes share one endpoint URL
+@api.post("/generate")
+async def generate_text(prompt: str): ...
+
+@api.post("/translate")
+async def translate_text(text: str): ...
+
+@api.get("/health")
+async def health_check(): ...
+```
+
+```bash
+# All use the same base URL with different paths
+curl -X POST https://abc123xyz.api.runpod.ai/generate -H "..." -d '{...}'
+curl -X POST https://abc123xyz.api.runpod.ai/translate -H "..." -d '{...}'
+curl -X GET https://abc123xyz.api.runpod.ai/health -H "..."
+```
+
+## Quick reference
+
+| Endpoint Type | Routes | Request Format | Response |
+|--------------|--------|----------------|----------|
+| Queue-based | `/run`, `/runsync`, `/status/{id}` | `{"input": {...}}` | Job ID (async) or result (sync) |
+| Load-balanced | Custom paths (e.g., `/process`) | Direct JSON payload | Direct response |
+
+## Response status codes
+
+| Code | Meaning |
+|------|---------|
+| `200` | Success (load-balanced) or job accepted (queue-based) |
+| `400` | Bad request (invalid input format) |
+| `401` | Unauthorized (invalid or missing API key) |
+| `404` | Route not found |
+| `500` | Internal server error |
+
+## Error handling
+
+Queue-based errors appear in the job output:
+
+```json
+{
+ "id": "job-abc-123",
+ "status": "FAILED",
+ "error": "Error message from your function"
+}
+```
+
+Load-balanced errors return HTTP error codes with JSON body:
+
+```json
+{
+ "error": "Error message from your function",
+ "detail": "Additional error context"
+}
+```
+
+## Using SDKs
+
+For programmatic access, use the Runpod Python SDK:
+
+```python
+import runpod
+
+# Set API key
+runpod.api_key = "your_api_key"
+
+# Connect to endpoint
+endpoint = runpod.Endpoint("YOUR_ENDPOINT_ID")
+
+# Async call (returns job object immediately)
+run_request = endpoint.run({"prompt": "Hello world"})
+status = run_request.status() # Check status
+output = run_request.output() # Get result once complete
+
+# Sync call (blocks until complete)
+result = endpoint.run_sync({"prompt": "Hello world"})
+```
+
+See the [Runpod SDK documentation](/sdks/python/endpoints) for complete SDK usage.
+
+## Next steps
+
+
+
+ Deploy your Flash app to get endpoint URLs.
+
+
+ View all endpoint configuration parameters.
+
+
+ Use the Python SDK for programmatic access.
+
+
diff --git a/flash/cli/app.mdx b/flash/cli/app.mdx
new file mode 100644
index 00000000..50912d2b
--- /dev/null
+++ b/flash/cli/app.mdx
@@ -0,0 +1,181 @@
+---
+title: "app"
+sidebarTitle: "app"
+---
+
+Manage Flash applications. An app is the top-level container that groups your deployment environments, build artifacts, and configuration.
+
+```bash Command
+flash app [OPTIONS]
+```
+
+## Subcommands
+
+| Subcommand | Description |
+|------------|-------------|
+| `list` | Show all apps in your account |
+| `create` | Create a new app |
+| `get` | Show details of an app |
+| `delete` | Delete an app and all its resources |
+
+---
+
+## app list
+
+Show all Flash apps under your account.
+
+```bash Command
+flash app list
+```
+
+### Output
+
+```text
+┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
+┃ Name ┃ ID ┃ Environments ┃ Builds ┃
+┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
+│ my-project │ app_abc123 │ dev, staging, prod │ build_1, build_2 │
+│ demo-api │ app_def456 │ production │ build_3 │
+│ ml-inference │ app_ghi789 │ dev, production │ build_4, build_5 │
+└────────────────┴──────────────────────┴─────────────────────────┴──────────────────┘
+```
+
+---
+
+## app create
+
+Register a new Flash app on Runpod's backend.
+
+```bash Command
+flash app create
+```
+
+### Arguments
+
+
+Name for the new Flash app. Must be unique within your account.
+
+
+### What it creates
+
+This command registers a Flash app in Runpod's backend—essentially creating a namespace for your environments and builds. It does not:
+
+- Create local files (use `flash init` for that).
+- Provision cloud resources (endpoints, volumes, etc.).
+- Deploy any code.
+
+The app is just a container that groups environments and builds together.
+
+### When to use
+
+
+
+Most users don't need to run `flash app create` explicitly. Apps are created automatically when you first run `flash deploy`. This command is primarily for CI/CD pipelines that need to pre-register apps before deployment.
+
+
+
+---
+
+## app get
+
+Get detailed information about a Flash app.
+
+```bash Command
+flash app get
+```
+
+### Arguments
+
+
+Name of the Flash app to inspect.
+
+
+### Output
+
+```text
+╭─────────────────────────────────╮
+│ Flash App: my-project │
+├─────────────────────────────────┤
+│ Name: my-project │
+│ ID: app_abc123 │
+│ Environments: 3 │
+│ Builds: 5 │
+╰─────────────────────────────────╯
+
+ Environments
+┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
+┃ Name ┃ ID ┃ State ┃ Active Build ┃ Created ┃
+┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
+│ dev │ env_dev123 │ DEPLOYED│ build_xyz789 │ 2024-01-15 10:30 │
+│ staging │ env_stg456 │ DEPLOYED│ build_xyz789 │ 2024-01-16 14:20 │
+│ production │ env_prd789 │ DEPLOYED│ build_abc123 │ 2024-01-20 09:15 │
+└────────────┴────────────────────┴─────────┴──────────────────┴──────────────────┘
+
+ Builds
+┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
+┃ ID ┃ Status ┃ Created ┃
+┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
+│ build_abc123 │ COMPLETED │ 2024-01-20 09:00 │
+│ build_xyz789 │ COMPLETED │ 2024-01-18 15:45 │
+│ build_def456 │ COMPLETED │ 2024-01-15 11:20 │
+└────────────────────┴──────────────────────────┴──────────────────┘
+```
+
+---
+
+## app delete
+
+Delete a Flash app and all its associated resources.
+
+```bash Command
+flash app delete
+```
+
+### Arguments
+
+
+Name of the Flash app to delete.
+
+
+### Process
+
+1. Shows app details and resources to be deleted.
+2. Prompts for confirmation (required).
+3. Deletes all environments and their resources.
+4. Deletes all builds.
+5. Deletes the app.
+
+
+
+This operation is irreversible. All environments, builds, endpoints, volumes, and configuration will be permanently deleted.
+
+
+
+---
+
+## App hierarchy
+
+See [Apps and environments](/flash/apps/apps-and-environments#app-hierarchy) for the complete app organization structure.
+
+## Auto-detection
+
+Flash CLI automatically detects the app name from your current directory:
+
+```bash
+cd /path/to/APP_NAME
+flash deploy # Deploys to 'APP_NAME' app
+flash env list # Lists 'APP_NAME' environments
+```
+
+Override with the `--app` flag:
+
+```bash
+flash deploy --app other-project
+flash env list --app other-project
+```
+
+## Related commands
+
+- [`flash env`](/flash/cli/env) - Manage environments within an app
+- [`flash deploy`](/flash/cli/deploy) - Deploy to an app's environment
+- [`flash init`](/flash/cli/init) - Create a new project
diff --git a/flash/cli/build.mdx b/flash/cli/build.mdx
new file mode 100644
index 00000000..cf718e53
--- /dev/null
+++ b/flash/cli/build.mdx
@@ -0,0 +1,160 @@
+---
+title: "build"
+sidebarTitle: "build"
+---
+
+Build a deployment-ready artifact for your Flash application without deploying. Use this for more control over the build process or to inspect the artifact before deploying.
+
+```bash
+flash build [OPTIONS]
+```
+
+## Examples
+
+Build with all dependencies:
+
+```bash
+flash build
+```
+
+Build with additional excluded packages:
+
+```bash
+flash build --exclude scipy,pandas
+```
+
+Build with custom output name:
+
+```bash
+flash build -o my-app.tar.gz
+```
+
+## Flags
+
+
+Skip transitive dependencies during pip install. Only installs direct dependencies specified in `@Endpoint` decorators. Useful when the base image already includes dependencies.
+
+
+
+Custom name for the output archive file.
+
+
+
+Comma-separated list of packages to exclude from the build (e.g., `torch,torchvision`). Use this to skip packages already in the base image.
+
+
+## What happens during build
+
+1. **Function discovery**: Finds all `@Endpoint` decorated functions.
+2. **Grouping**: Groups functions by their endpoint configuration.
+3. **Manifest generation**: Creates `.flash/flash_manifest.json` with endpoint definitions.
+4. **Dependency installation**: Installs Python packages for Linux x86_64.
+5. **Packaging**: Bundles everything into `.flash/artifact.tar.gz`.
+
+## Build artifacts
+
+After running `flash build`:
+
+| File/Directory | Description |
+|----------------|-------------|
+| `.flash/artifact.tar.gz` | Deployment package ready for Runpod |
+| `.flash/flash_manifest.json` | Service discovery configuration |
+| `.flash/.build/` | Build directory (kept for inspection) |
+
+## Cross-platform builds
+
+Flash automatically handles cross-platform builds:
+
+- **Automatic platform targeting**: Dependencies are installed for Linux x86_64, regardless of your build platform.
+- **Binary wheel enforcement**: Only pre-built wheels are used, preventing compilation issues.
+
+### Python version in deployed workers
+
+Your local Python version does not affect what runs in the cloud. `flash build` downloads wheels for the container's Python version automatically.
+
+| Worker type | Python version | Notes |
+|-------------|----------------|-------|
+| GPU | 3.12 only | The GPU base image includes multiple interpreters (3.9–3.14) for interactive pod use, but torch and CUDA libraries are installed only for 3.12. |
+| CPU | 3.10, 3.11, or 3.12 | Configurable via the `PYTHON_VERSION` build arg. |
+
+Image tags follow the pattern `py{version}-{tag}` (for example, `runpod/flash:py3.12-latest`).
+
+## Managing deployment size
+
+Runpod Serverless has a **500MB deployment limit**. Flash automatically excludes packages that are pre-installed in the base image:
+
+- `torch`, `torchvision`, `torchaudio`
+- `numpy`, `triton`
+
+These packages are excluded at archive time, so you don't need to specify them manually.
+
+### Manual exclusions
+
+Use `--exclude` to skip additional packages that are already in a custom base image or not needed:
+
+```bash
+flash build --exclude scipy,pandas
+```
+
+### Base image reference
+
+| Resource type | Base image | Auto-excluded packages |
+|--------------|------------|------------------------|
+| GPU | PyTorch base | `torch`, `torchvision`, `torchaudio`, `numpy`, `triton` |
+| CPU | Python slim | `torch`, `torchvision`, `torchaudio`, `numpy`, `triton` |
+
+
+
+Check the [worker-flash repository](https://github.com/runpod-workers/worker-flash) for current base images and pre-installed packages.
+
+
+
+## Troubleshooting
+
+### Build fails with "functions not found"
+
+Ensure your project has `@Endpoint` decorated functions:
+
+```python
+from runpod_flash import Endpoint, GpuGroup
+
+@Endpoint(name="my-worker", gpu=GpuGroup.ANY)
+def my_function(data):
+ return {"result": data}
+```
+
+### Archive is too large
+
+Base image packages (`torch`, `numpy`, `triton`, etc.) are auto-excluded. If the archive is still too large, use `--exclude` to skip additional packages or `--no-deps` to skip transitive dependencies:
+
+```bash
+flash build --exclude scipy,pandas
+```
+
+### Dependency installation fails
+
+If a package doesn't have Linux x86_64 wheels:
+
+1. Ensure standard pip is installed: `python -m ensurepip --upgrade`
+2. Check PyPI for Linux wheel availability.
+3. For Python 3.13+, some packages may require newer manylinux versions.
+
+### Need to examine generated files
+
+The build directory is kept after building. Inspect it with:
+
+```bash
+ls .flash/.build/
+```
+
+## Related commands
+
+- [`flash deploy`](/flash/cli/deploy) - Build and deploy in one step (includes `--preview` option for local testing)
+- [`flash run`](/flash/cli/run) - Start development server
+- [`flash env`](/flash/cli/env) - Manage environments
+
+
+
+Most users should use `flash deploy` instead, which runs build and deploy in one step. Use `flash build` when you need more control or want to inspect the artifact.
+
+
diff --git a/flash/cli/deploy.mdx b/flash/cli/deploy.mdx
new file mode 100644
index 00000000..233d8a10
--- /dev/null
+++ b/flash/cli/deploy.mdx
@@ -0,0 +1,255 @@
+---
+title: "deploy"
+sidebarTitle: "deploy"
+---
+
+Build and deploy your Flash application to Runpod Serverless endpoints in one step. This is the primary command for getting your application running in the cloud.
+
+```bash
+flash deploy [OPTIONS]
+```
+
+## Examples
+
+Build and deploy a Flash app from the current directory (auto-selects environment if only one exists):
+
+```bash
+flash deploy
+```
+
+Deploy to a specific environment:
+
+```bash
+flash deploy --env production
+```
+
+Deploy with additional excluded packages:
+
+```bash
+flash deploy --exclude scipy,pandas
+```
+
+Build and test locally before deploying:
+
+```bash
+flash deploy --preview
+```
+
+## Flags
+
+
+Target environment name (e.g., `dev`, `staging`, `production`). Auto-selected if only one exists. Creates the environment if it doesn't exist.
+
+
+
+Flash app name. Auto-detected from the current directory if not specified.
+
+
+
+Skip transitive dependencies during pip install. Useful when the base image already includes dependencies.
+
+
+
+Comma-separated packages to exclude (e.g., `torch,torchvision`). Use this to stay under the 500MB deployment limit.
+
+
+
+Custom archive name for the build artifact.
+
+
+
+Build and launch a local Docker-based preview environment instead of deploying to Runpod.
+
+
+## What happens during deployment
+
+1. **Build phase**: Creates the deployment artifact (same as `flash build`).
+2. **Environment resolution**: Detects or creates the target environment.
+3. **Upload**: Sends the artifact to Runpod storage.
+4. **Provisioning**: Creates or updates Serverless endpoints.
+5. **Configuration**: Sets up environment variables and service discovery.
+
+## Architecture
+
+After deployment, your Flash app runs as independent Serverless endpoints on Runpod:
+
+
+
+Each resource configuration in your code creates an independent endpoint. You can call any endpoint directly based on your needs.
+
+## Environment management
+
+### Automatic creation
+
+If the specified environment doesn't exist, `flash deploy` creates it:
+
+```bash
+# Creates 'staging' if it doesn't exist
+flash deploy --env staging
+```
+
+### Auto-selection
+
+When you have only one environment, it's selected automatically:
+
+```bash
+# Auto-selects the only available environment
+flash deploy
+```
+
+When multiple environments exist, you must specify one:
+
+```bash
+# Required when multiple environments exist
+flash deploy --env staging
+```
+
+### Default environment
+
+If no environment exists and none is specified, Flash creates a `production` environment by default.
+
+## Post-deployment
+
+After successful deployment, Flash displays all deployed endpoints:
+
+```text
+✓ Deployment Complete
+
+Load-balanced endpoints:
+ https://abc123xyz.api.runpod.ai (lb_worker)
+ POST /process
+ GET /health
+
+ Try it:
+ curl -X POST https://abc123xyz.api.runpod.ai/process \
+ -H "Content-Type: application/json" \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -d '{"input": {}}'
+
+Queue-based endpoints:
+ https://api.runpod.ai/v2/def456xyz (gpu_worker)
+ https://api.runpod.ai/v2/ghi789xyz (cpu_worker)
+
+ Try it:
+ curl -X POST https://api.runpod.ai/v2/def456xyz/runsync \
+ -H "Content-Type: application/json" \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -d '{"input": {}}'
+```
+
+Each endpoint is independent with its own URL and can be called directly.
+
+### Authentication
+
+All deployed endpoints require authentication with your Runpod API key:
+
+```bash
+export RUNPOD_API_KEY="your_key_here"
+
+curl -X POST https://YOUR_ENDPOINT_URL/path \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"param": "value"}'
+```
+
+## Preview mode
+
+Test locally before deploying:
+
+```bash
+flash deploy --preview
+```
+
+This builds your project and runs it in Docker containers locally:
+
+- Each endpoint runs in its own container.
+- All containers communicate via Docker network.
+- Endpoints exposed on local ports for testing.
+- Press `Ctrl+C` to stop.
+
+## Managing deployment size
+
+Runpod Serverless has a **500MB limit**. Flash automatically excludes packages that are pre-installed in the base image (`torch`, `torchvision`, `torchaudio`, `numpy`, `triton`).
+
+If the deployment is still too large, use `--exclude` to skip additional packages:
+
+```bash
+flash deploy --exclude scipy,pandas
+```
+
+See [`flash build` - Managing deployment size](/flash/cli/build#managing-deployment-size) for more details.
+
+## flash run vs flash deploy
+
+See [`flash run`](/flash/cli/run#flash-run-vs-flash-deploy) for a detailed comparison of local development vs production deployment.
+
+## Troubleshooting
+
+### Multiple environments error
+
+```text
+Error: Multiple environments found: dev, staging, production
+```
+
+Specify the target environment:
+
+```bash
+flash deploy --env staging
+```
+
+### Deployment size limit
+
+Base image packages are auto-excluded. If the deployment is still too large, use `--exclude` to skip additional packages:
+
+```bash
+flash deploy --exclude scipy,pandas
+```
+
+### Authentication fails
+
+Ensure your API key is set:
+
+```bash
+echo $RUNPOD_API_KEY
+export RUNPOD_API_KEY="your_key_here"
+```
+
+## Related commands
+
+- [`flash build`](/flash/cli/build) - Build without deploying
+- [`flash run`](/flash/cli/run) - Local development server
+- [`flash env`](/flash/cli/env) - Manage environments
+- [`flash app`](/flash/cli/app) - Manage applications
+- [`flash undeploy`](/flash/cli/undeploy) - Remove endpoints
diff --git a/flash/cli/env.mdx b/flash/cli/env.mdx
new file mode 100644
index 00000000..7d4494ba
--- /dev/null
+++ b/flash/cli/env.mdx
@@ -0,0 +1,255 @@
+---
+title: "env"
+sidebarTitle: "env"
+---
+
+Manage deployment environments for Flash applications. Environments are isolated deployment contexts (like `dev`, `staging`, `production`) within a Flash app.
+
+```bash Command
+flash env [OPTIONS]
+```
+
+## Subcommands
+
+| Subcommand | Description |
+|------------|-------------|
+| `list` | Show all environments for an app |
+| `create` | Create a new environment |
+| `get` | Show details of an environment |
+| `delete` | Delete an environment and its resources |
+
+---
+
+## env list
+
+Show all available environments for an app.
+
+```bash Command
+flash env list [OPTIONS]
+```
+
+### Example
+
+```bash
+# List environments for current app
+flash env list
+
+# List environments for specific app
+flash env list --app APP_NAME
+```
+
+### Flags
+
+
+Flash app name. Auto-detected from current directory if not specified.
+
+
+### Output
+
+```text
+┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
+┃ Name ┃ ID ┃ Active Build ┃ Created At ┃
+┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
+│ dev │ env_abc123 │ build_xyz789 │ 2024-01-15 10:30 │
+│ staging │ env_def456 │ build_uvw456 │ 2024-01-16 14:20 │
+│ production │ env_ghi789 │ build_rst123 │ 2024-01-20 09:15 │
+└────────────┴─────────────────────┴───────────────────┴──────────────────┘
+```
+
+---
+
+## env create
+
+Create a new deployment environment.
+
+```bash Command
+flash env create [OPTIONS]
+```
+
+### Example
+
+```bash
+# Create staging environment
+flash env create staging
+
+# Create environment in specific app
+flash env create production --app APP_NAME
+```
+
+### Arguments
+
+
+Name for the new environment (e.g., `dev`, `staging`, `production`).
+
+
+### Flags
+
+
+Flash app name. Auto-detected from current directory if not specified.
+
+
+### Notes
+
+- If the app doesn't exist, it's created automatically.
+- Environment names must be unique within an app.
+- Newly created environments have no active build until first deployment.
+
+
+
+You don't always need to create environments explicitly. Running `flash deploy --env ` creates the environment automatically if it doesn't exist.
+
+
+
+---
+
+## env get
+
+Show detailed information about a deployment environment.
+
+```bash Command
+flash env get [OPTIONS]
+```
+
+### Example
+
+```bash
+# Get details for production environment
+flash env get production
+
+# Get details for specific app's environment
+flash env get staging --app APP_NAME
+```
+
+### Arguments
+
+
+Name of the environment to inspect.
+
+
+### Flags
+
+
+Flash app name. Auto-detected from current directory if not specified.
+
+
+### Output
+
+```text
+╭────────────────────────────────────╮
+│ Environment: production │
+├────────────────────────────────────┤
+│ ID: env_ghi789 │
+│ State: DEPLOYED │
+│ Active Build: build_rst123 │
+│ Created: 2024-01-20 09:15:00 │
+╰────────────────────────────────────╯
+
+ Associated Endpoints
+┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
+┃ Name ┃ ID ┃
+┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
+│ my-gpu │ ep_abc123 │
+│ my-cpu │ ep_def456 │
+└────────────────┴────────────────────┘
+
+ Associated Network Volumes
+┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
+┃ Name ┃ ID ┃
+┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
+│ model-cache │ nv_xyz789 │
+└────────────────┴────────────────────┘
+```
+
+---
+
+## env delete
+
+Delete a deployment environment and all its associated resources.
+
+```bash Command
+flash env delete [OPTIONS]
+```
+
+### Examples
+
+```bash
+# Delete development environment
+flash env delete dev
+
+# Delete environment in specific app
+flash env delete staging --app APP_NAME
+```
+
+### Arguments
+
+
+Name of the environment to delete.
+
+
+### Flags
+
+
+Flash app name. Auto-detected from current directory if not specified.
+
+
+### Process
+
+1. Shows environment details and resources to be deleted.
+2. Prompts for confirmation (required).
+3. Undeploys all associated endpoints.
+4. Removes all associated network volumes.
+5. Deletes the environment from the app.
+
+
+
+This operation is irreversible. All endpoints, volumes, and configuration associated with the environment will be permanently deleted.
+
+
+
+---
+
+## Environment states
+
+| State | Description |
+|-------|-------------|
+| PENDING | Environment created but not deployed |
+| DEPLOYING | Deployment in progress |
+| DEPLOYED | Successfully deployed and running |
+| FAILED | Deployment or health check failed |
+| DELETING | Deletion in progress |
+
+## Common workflows
+
+### Three-tier deployment
+
+```bash
+# Create environments
+flash env create dev
+flash env create staging
+flash env create production
+
+# Deploy to each
+flash deploy --env dev
+flash deploy --env staging
+flash deploy --env production
+```
+
+### Feature branch testing
+
+```bash
+# Create feature environment
+flash env create FEATURE_NAME
+
+# Deploy feature branch
+git checkout FEATURE_NAME
+flash deploy --env FEATURE_NAME
+
+# Clean up after merge
+flash env delete FEATURE_NAME
+```
+
+## Related commands
+
+- [`flash deploy`](/flash/cli/deploy) - Deploy to an environment
+- [`flash app`](/flash/cli/app) - Manage applications
+- [`flash undeploy`](/flash/cli/undeploy) - Remove specific endpoints
diff --git a/flash/cli/init.mdx b/flash/cli/init.mdx
new file mode 100644
index 00000000..289fbcb1
--- /dev/null
+++ b/flash/cli/init.mdx
@@ -0,0 +1,86 @@
+---
+title: "init"
+sidebarTitle: "init"
+---
+
+Create a new Flash project with a ready-to-use template structure including a FastAPI server, example GPU and CPU workers, and configuration files.
+
+```bash
+flash init [PROJECT_NAME] [OPTIONS]
+```
+
+## Example
+
+Create a new project directory:
+
+```bash
+flash init PROJECT_NAME
+cd PROJECT_NAME
+pip install -r requirements.txt
+flash run
+```
+
+Initialize in the current directory:
+
+```bash
+flash init .
+```
+
+## Arguments
+
+
+Name of the project directory to create. If omitted or set to `.`, initializes in the current directory.
+
+
+## Flags
+
+
+Overwrite existing files if they already exist in the target directory.
+
+
+## What it creates
+
+The command creates the following project structure:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+### Template contents
+
+- **lb_worker.py**: load-balanced endpoint with HTTP routes. Contains `@Endpoint` functions with custom HTTP methods and paths (e.g., `POST /process`, `GET /health`). Multiple routes can share the same endpoint.
+- **gpu_worker.py**: GPU queue-based endpoint. Contains an `@Endpoint` function that runs on GPU hardware. Provides `/run` or `/runsync` routes for job submission. Creates one Serverless endpoint when deployed.
+- **cpu_worker.py**: CPU queue-based endpoint. Contains an `@Endpoint` function that runs on CPU-only instances. Provides `/run` or `/runsync` routes for job submission. Creates one Serverless endpoint when deployed.
+- **.env**: Template for environment variables including `RUNPOD_API_KEY`.
+
+## Next steps
+
+After initialization:
+
+1. Copy `.env.example` to `.env` (if needed) and add your `RUNPOD_API_KEY`.
+2. Install dependencies: `pip install -r requirements.txt`
+3. Start the development server: `flash run`
+4. Open http://localhost:8888/docs to explore the API.
+5. Customize the workers for your use case.
+6. Deploy with `flash deploy` when ready.
+
+
+
+This command only creates local files. It doesn't interact with Runpod or create any cloud resources. Cloud resources are created when you run `flash run` or `flash deploy`.
+
+
+
+## Related commands
+
+- [`flash run`](/flash/cli/run) - Start the development server
+- [`flash deploy`](/flash/cli/deploy) - Build and deploy to Runpod
diff --git a/flash/cli/login.mdx b/flash/cli/login.mdx
new file mode 100644
index 00000000..7d364299
--- /dev/null
+++ b/flash/cli/login.mdx
@@ -0,0 +1,77 @@
+---
+title: "login"
+sidebarTitle: "login"
+---
+
+Authenticate with Runpod and save your API key for all Flash operations, including CLI commands and standalone `@Endpoint` functions.
+
+```bash
+flash login [OPTIONS]
+```
+
+## Example
+
+Authenticate with Runpod (opens browser automatically):
+
+```bash
+flash login
+```
+
+The command opens your default browser to the Runpod authorization page. After you approve the request, your API key is saved locally for future CLI operations.
+
+## How it works
+
+1. Flash generates an authorization request.
+2. Your browser opens to the Runpod console authorization page.
+3. You approve the request in your browser.
+4. Flash saves your API key to `~/.runpod/credentials.toml`.
+
+## Flags
+
+
+Don't automatically open the browser. Instead, manually copy the authorization URL and open it yourself.
+
+
+
+Maximum time in seconds to wait for authorization. Default is 600 seconds (10 minutes).
+
+
+## Credential storage
+
+After successful login, your API key is saved to `~/.config/runpod/credentials.toml`. This file is used by:
+
+- All Flash CLI commands (`flash run`, `flash deploy`, etc.)
+- Standalone Python scripts using `@Endpoint` functions
+- Any code using the Flash SDK
+
+
+Keep your API key secure. Never commit it to version control. The credentials file is stored in your home directory, outside of project directories.
+
+
+
+## Alternative: Environment variable authentication
+
+Instead of using `flash login`, you can set your API key directly as an environment variable:
+
+```bash
+export RUNPOD_API_KEY=your_api_key_here
+```
+
+Or add it to your project's `.env` file:
+
+```bash
+RUNPOD_API_KEY=your_api_key_here
+```
+
+You can generate an API key with the correct permissions from [Settings > API Keys](https://www.runpod.io/console/user/settings) in the Runpod console.
+
+
+Your Runpod API key needs **All** access permissions to your Runpod account.
+
+
+
+## Related commands
+
+- [`flash init`](/flash/cli/init) - Create a new Flash project
+- [`flash run`](/flash/cli/run) - Start the development server
+- [`flash deploy`](/flash/cli/deploy) - Build and deploy to Runpod
diff --git a/flash/cli/overview.mdx b/flash/cli/overview.mdx
new file mode 100644
index 00000000..b3aaa290
--- /dev/null
+++ b/flash/cli/overview.mdx
@@ -0,0 +1,101 @@
+---
+title: "CLI overview"
+sidebarTitle: "Overview"
+description: "Learn how to use the Flash CLI for local development and deployment."
+---
+
+The Flash CLI provides commands for initializing projects, running local development servers, building deployment artifacts, and managing your applications on Runpod Serverless.
+
+Before using the CLI, make sure you've [installed Flash](/flash/overview#install-flash) and set your [Runpod API key](/get-started/api-keys) in your environment.
+
+## Available commands
+
+| Command | Description |
+|---------|-------------|
+| [`flash init`](/flash/cli/init) | Create a new Flash project with a template structure |
+| [`flash login`](/flash/cli/login) | Authenticate with Runpod using your API key |
+| [`flash run`](/flash/cli/run) | Start the local development server with automatic updates |
+| [`flash build`](/flash/cli/build) | Build a deployment artifact without deploying |
+| [`flash deploy`](/flash/cli/deploy) | Build and deploy your application to Runpod |
+| [`flash env`](/flash/cli/env) | Manage deployment environments |
+| [`flash app`](/flash/cli/app) | Manage Flash applications |
+| [`flash undeploy`](/flash/cli/undeploy) | Remove deployed endpoints |
+
+## Getting help
+
+View help for any command by adding `--help`:
+
+```bash
+flash --help
+flash deploy --help
+flash env --help
+```
+
+## Authentication
+
+Authenticate with Runpod using your API key:
+
+```bash
+flash login
+```
+
+This command prompts you for your Runpod API key and stores it securely for future CLI operations. You can find your API key in the [Runpod console](https://www.runpod.io/console/user/settings).
+
+Alternatively, set the `RUNPOD_API_KEY` environment variable:
+
+```bash
+export RUNPOD_API_KEY=your_api_key_here
+```
+
+## Common workflows
+
+### Local development
+
+```bash
+# Create a new project
+flash init PROJECT_NAME
+cd PROJECT_NAME
+
+# Install dependencies
+pip install -r requirements.txt
+
+# Add your API key to .env
+# Start the development server
+flash run
+```
+
+### Deploy to production
+
+```bash
+# Build and deploy
+flash deploy
+
+# Deploy to a specific environment
+flash deploy --env ENVIRONMENT_NAME
+```
+
+### Manage deployments
+
+```bash
+# List environments
+flash env list
+
+# Check environment status
+flash env get ENVIRONMENT_NAME
+
+# Remove an environment
+flash env delete ENVIRONMENT_NAME
+```
+
+### Clean up endpoints
+
+```bash
+# List deployed endpoints
+flash undeploy list
+
+# Remove specific endpoint
+flash undeploy ENDPOINT_NAME
+
+# Remove all endpoints
+flash undeploy --all
+```
\ No newline at end of file
diff --git a/flash/cli/run.mdx b/flash/cli/run.mdx
new file mode 100644
index 00000000..45b23380
--- /dev/null
+++ b/flash/cli/run.mdx
@@ -0,0 +1,168 @@
+---
+title: "run"
+sidebarTitle: "run"
+---
+
+Start the Flash development server for local testing with automatic updates. A local development server provides a unified interface for testing while `@Endpoint` functions execute on Runpod Serverless.
+
+```bash
+flash run [OPTIONS]
+```
+
+## Example
+
+Start the development server with defaults:
+
+```bash
+flash run
+```
+
+Start with auto-provisioning to eliminate cold-start delays:
+
+```bash
+flash run --auto-provision
+```
+
+Start on a custom port:
+
+```bash
+flash run --port 3000
+```
+
+## Flags
+
+
+Host address to bind the server to.
+
+
+
+Port number to bind the server to.
+
+
+
+Enable or disable auto-reload on code changes. Enabled by default.
+
+
+
+Auto-provision all Serverless endpoints on startup instead of lazily on first call. Eliminates cold-start delays during development.
+
+
+## Architecture
+
+With `flash run`, Flash starts a local development server alongside remote Serverless endpoints:
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'14px','fontFamily':'font-inter'}}}%%
+
+flowchart TB
+ Browser(["BROWSER/CURL"])
+
+ subgraph Local ["YOUR MACHINE (localhost:8888)"]
+ DevServer["Development Server • Auto-reload on changes • API explorer at /docs • Routes requests"]
+ end
+
+ subgraph Runpod ["RUNPOD SERVERLESS"]
+ LB["live-lb_worker"]
+ GPU["live-gpu_worker"]
+ CPU["live-cpu_worker"]
+ end
+
+ Browser -->|"HTTP"| DevServer
+ DevServer -->|"HTTPS"| LB
+ DevServer -->|"HTTPS"| GPU
+ DevServer -->|"HTTPS"| CPU
+
+ style Local fill:#1a1a2e,stroke:#5F4CFE,stroke-width:2px,color:#fff
+ style Runpod fill:#1a1a2e,stroke:#5F4CFE,stroke-width:2px,color:#fff
+ style Browser fill:#4D38F5,stroke:#4D38F5,color:#fff
+ style DevServer fill:#5F4CFE,stroke:#5F4CFE,color:#fff
+ style LB fill:#22C55E,stroke:#22C55E,color:#000
+ style GPU fill:#22C55E,stroke:#22C55E,color:#000
+ style CPU fill:#22C55E,stroke:#22C55E,color:#000
+```
+
+**Key points:**
+
+- A local development server provides a convenient testing interface at `localhost:8888`.
+- `@Endpoint` functions deploy to Runpod Serverless with `live-` prefix to distinguish from production.
+- Code changes are picked up automatically without restarting the server.
+- The development server routes requests to appropriate remote endpoints.
+
+This differs from `flash deploy`, where all endpoints run on Runpod without a local server.
+
+## Auto-provisioning
+
+By default, endpoints are provisioned lazily on first `@Endpoint` function call. Use `--auto-provision` to provision all endpoints at server startup:
+
+```bash
+flash run --auto-provision
+```
+
+### How it works
+
+1. **Discovery**: Scans your app for `@Endpoint` decorated functions.
+2. **Deployment**: Deploys resources concurrently (up to 3 at a time).
+3. **Confirmation**: Asks for confirmation if deploying more than 5 endpoints.
+4. **Caching**: Stores deployed resources in `.runpod/resources.pkl` for reuse.
+5. **Updates**: Recognizes existing endpoints and updates if configuration changed.
+
+### Benefits
+
+- **Zero cold start**: All endpoints ready before you test them.
+- **Faster development**: No waiting for deployment on first HTTP call.
+- **Resource reuse**: Cached endpoints are reused across server restarts.
+
+### When to use
+
+- Local development with multiple endpoints.
+- Testing workflows that call multiple remote functions.
+- Debugging where you want deployment separated from handler logic.
+
+## Provisioning modes
+
+| Mode | When endpoints are deployed |
+|------|----------------------------|
+| Default (lazy) | On first `@Endpoint` function call |
+| `--auto-provision` | At server startup |
+
+## Testing your API
+
+Once the server is running, test your endpoints:
+
+```bash
+# Health check
+curl http://localhost:8888/
+
+# Call a queue-based GPU endpoint (gpu_worker.py)
+curl -X POST http://localhost:8888/gpu_worker/runsync \
+ -H "Content-Type: application/json" \
+ -d '{"message": "Hello from GPU!"}'
+
+# Call a load-balanced endpoint (lb_worker.py)
+curl -X POST http://localhost:8888/lb_worker/process \
+ -H "Content-Type: application/json" \
+ -d '{"data": "test"}'
+```
+
+Open http://localhost:8888/docs for the interactive API explorer.
+
+## Requirements
+
+- `RUNPOD_API_KEY` must be set in your `.env` file or environment.
+- A valid Flash project structure (created by `flash init` or manually).
+
+## flash run vs flash deploy
+
+| Aspect | `flash run` | `flash deploy` |
+|--------|-------------|----------------|
+| Local development server | Yes (http://localhost:8888) | No |
+| `@Endpoint` functions run on | Runpod Serverless | Runpod Serverless |
+| Endpoint persistence | Temporary (`live-` prefix) | Persistent |
+| Code updates | Automatic reload | Manual redeploy |
+| Use case | Development | Production |
+
+## Related commands
+
+- [`flash init`](/flash/cli/init) - Create a new project
+- [`flash deploy`](/flash/cli/deploy) - Deploy to production
+- [`flash undeploy`](/flash/cli/undeploy) - Remove endpoints
diff --git a/flash/cli/undeploy.mdx b/flash/cli/undeploy.mdx
new file mode 100644
index 00000000..33506823
--- /dev/null
+++ b/flash/cli/undeploy.mdx
@@ -0,0 +1,213 @@
+---
+title: "undeploy"
+sidebarTitle: "undeploy"
+---
+
+Manage and delete Runpod Serverless endpoints deployed via Flash. Use this command to clean up endpoints created during local development with `flash run`.
+
+```bash
+flash undeploy [NAME|list] [OPTIONS]
+```
+
+## Example
+
+List all tracked endpoints:
+
+```bash
+flash undeploy list
+```
+
+Remove a specific endpoint:
+
+```bash
+flash undeploy ENDPOINT_NAME
+```
+
+Remove all endpoints:
+
+```bash
+flash undeploy --all
+```
+
+## Usage modes
+
+### List endpoints
+
+Display all tracked endpoints with their current status:
+
+```bash
+flash undeploy list
+```
+
+Output includes:
+
+- **Name**: Endpoint name
+- **Endpoint ID**: Runpod endpoint identifier
+- **Status**: Current health status (Active/Inactive/Unknown)
+- **Type**: Resource type (Live Serverless, Cpu Live Serverless, etc.)
+
+**Status indicators:**
+
+| Status | Meaning |
+|--------|---------|
+| Active | Endpoint is running and responding |
+| Inactive | Tracking exists but endpoint deleted externally |
+| Unknown | Error during health check |
+
+### Undeploy by name
+
+Delete a specific endpoint:
+
+```bash
+flash undeploy ENDPOINT_NAME
+```
+
+This:
+
+1. Searches for endpoints matching the name.
+2. Shows endpoint details.
+3. Prompts for confirmation.
+4. Deletes the endpoint from Runpod.
+5. Removes from local tracking.
+
+### Undeploy all
+
+Delete all tracked endpoints (requires double confirmation):
+
+```bash
+flash undeploy --all
+```
+
+Safety features:
+
+1. Shows total count of endpoints.
+2. First confirmation: Yes/No prompt.
+3. Second confirmation: Type "DELETE ALL" exactly.
+4. Deletes all endpoints from Runpod.
+5. Removes all from tracking.
+
+### Interactive selection
+
+Select endpoints to undeploy using checkboxes:
+
+```bash
+flash undeploy --interactive
+```
+
+Use arrow keys to navigate, space bar to select/deselect, and Enter to confirm.
+
+### Clean up stale tracking
+
+Remove inactive endpoints from tracking without API deletion:
+
+```bash
+flash undeploy --cleanup-stale
+```
+
+Use this when endpoints were deleted via the Runpod console or API (not through Flash). The local tracking file (`.runpod/resources.pkl`) becomes stale, and this command cleans it up.
+
+## Flags
+
+
+Undeploy all tracked endpoints. Requires double confirmation for safety.
+
+
+
+Interactive checkbox selection mode. Select multiple endpoints to undeploy.
+
+
+
+Remove inactive endpoints from local tracking without attempting API deletion. Use when endpoints were deleted externally.
+
+
+## Arguments
+
+
+Name of the endpoint to undeploy. Use `list` to show all endpoints.
+
+
+## undeploy vs env delete
+
+| Command | Scope | When to use |
+|---------|-------|-------------|
+| `flash undeploy` | Individual endpoints from local tracking | Development cleanup, granular control |
+| `flash env delete` | Entire environment + all resources | Production cleanup, full teardown |
+
+For production deployments, use `flash env delete` to remove entire environments and all associated resources.
+
+## How tracking works
+
+Flash tracks deployed endpoints in `.runpod/resources.pkl`. Endpoints are added when you:
+
+- Run `flash run --auto-provision`
+- Run `flash run` and call `@Endpoint` functions
+- Run `flash deploy`
+
+The tracking file is in `.gitignore` and should never be committed. It contains local deployment state.
+
+## Common workflows
+
+### Basic cleanup
+
+```bash
+# Check what's deployed
+flash undeploy list
+
+# Remove a specific endpoint
+flash undeploy ENDPOINT_NAME
+
+# Clean up stale tracking
+flash undeploy --cleanup-stale
+```
+
+### Bulk operations
+
+```bash
+# Undeploy all endpoints
+flash undeploy --all
+
+# Interactive selection
+flash undeploy --interactive
+```
+
+### Managing external deletions
+
+If you delete endpoints via the Runpod console:
+
+```bash
+# Check status - will show as "Inactive"
+flash undeploy list
+
+# Remove stale tracking entries
+flash undeploy --cleanup-stale
+```
+
+## Troubleshooting
+
+### Endpoint shows as "Inactive"
+
+The endpoint was deleted via Runpod console or API. Clean up:
+
+```bash
+flash undeploy --cleanup-stale
+```
+
+### Can't find endpoint by name
+
+Check the exact name:
+
+```bash
+flash undeploy list
+```
+
+### Undeploy fails with API error
+
+1. Check `RUNPOD_API_KEY` in `.env`.
+2. Verify network connectivity.
+3. Check if the endpoint still exists on Runpod.
+
+## Related commands
+
+- [`flash run`](/flash/cli/run) - Development server (creates endpoints)
+- [`flash deploy`](/flash/cli/deploy) - Deploy to Runpod
+- [`flash env delete`](/flash/cli/env) - Delete entire environment
diff --git a/flash/configuration/best-practices.mdx b/flash/configuration/best-practices.mdx
new file mode 100644
index 00000000..b4b6ebc1
--- /dev/null
+++ b/flash/configuration/best-practices.mdx
@@ -0,0 +1,170 @@
+---
+title: "Configuration best practices"
+sidebarTitle: "Best practices"
+description: "Recommended configurations for production, development, and cost optimization."
+---
+
+This guide provides best practices for configuring Flash endpoints based on your use case. Recommendations are organized by workload type and optimization goal.
+
+## Production workloads
+
+Here are some best practices for production deployments requiring reliability and consistent performance:
+
+### General recommendations
+
+- **Pin specific GPU types** instead of using `GpuGroup.ANY` for predictable performance and costs.
+- **Use network volumes** for large models to avoid downloading on each worker startup.
+- **Set appropriate `execution_timeout_ms`** to prevent runaway jobs and control costs.
+- **Use environment variables** for configuration and secrets, not hardcoded values.
+
+### Queue-based endpoints
+
+Queue-based endpoints handle asynchronous batch processing where jobs can wait in queue:
+
+```python
+from runpod_flash import Endpoint, GpuType, NetworkVolume
+
+@Endpoint(
+ name="production-batch",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe, # Specific GPU for predictable performance
+ workers=(1, 10), # At least 1 worker, scale up to 10
+ idle_timeout=1200, # 20 minutes - keep workers longer for variable traffic
+ execution_timeout_ms=600000, # 10 minute timeout
+ volume=NetworkVolume(name="my-volume"),
+ env={"MODEL_PATH": "/runpod-volume/models"}
+)
+def process_batch(data): ...
+```
+
+**Key settings**:
+- `workers=(1, n)`: Set min to 1 to avoid cold starts for first job in queue.
+- `workers=(n, max)`: Set max based on expected peak concurrent jobs.
+- `idle_timeout`: 900-1800 seconds (15-30 minutes) for production workloads.
+
+### Load-balanced endpoints
+
+Load-balanced endpoints handle synchronous HTTP requests where immediate response is critical:
+
+```python
+from runpod_flash import Endpoint, GpuType, NetworkVolume
+
+api = Endpoint(
+ name="production-api",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, # Specific GPU for consistent performance
+ workers=(3, 20), # Always keep 3 workers ready, scale to 20
+ idle_timeout=1800, # 30 minutes - keep workers active longer
+ execution_timeout_ms=60000, # 60 second timeout per request
+ volume=NetworkVolume(name="my-volume")
+)
+
+@api.post("/process")
+async def process_request(data: dict) -> dict:
+ return {"result": "processed"}
+
+@api.get("/health")
+async def health_check() -> dict:
+ return {"status": "healthy"}
+```
+
+**Key settings**:
+- `workers=(n, max)`: Set min ≥ 1 for production APIs to avoid cold starts. Unlike queue-based endpoints where jobs can wait, API clients expect immediate responses.
+- `workers=(min, n)`: Set max based on expected peak concurrent requests.
+- `idle_timeout`: 1200-1800 seconds (20-30 minutes) to keep workers ready.
+- Include health check routes (e.g., `GET /health`) for monitoring.
+
+## Development
+
+Here are some best practices for development and testing environments prioritizing fast iteration:
+
+### General recommendations
+
+- **Use `GpuGroup.ANY`** for fastest GPU provisioning during development.
+- **Set `workers=(0, n)`** to minimize costs when not actively testing.
+- **Keep max workers low** (1-3) to control development expenses.
+- **Use short `idle_timeout`** (300 seconds / 5 minutes) to scale down quickly between test runs.
+- **Test locally** with `flash run` before deploying to production.
+
+### Example configuration
+
+```python
+from runpod_flash import Endpoint, GpuGroup
+
+@Endpoint(
+ name="dev-testing",
+ gpu=GpuGroup.ANY, # Fast provisioning
+ workers=(0, 2), # Scale to zero, limit to 2 concurrent
+ idle_timeout=300 # 5 minutes - quick scale-down
+)
+def test_function(data): ...
+```
+
+## Cost optimization
+
+Here are some best practices for minimizing costs on infrequent or batch workloads:
+
+### General recommendations
+
+- **Set `workers=(0, n)`** to scale to zero when idle (no usage = no cost).
+- **Use smaller GPU types** when workload allows (e.g., `GpuType.NVIDIA_GEFORCE_RTX_4090` instead of `GpuType.NVIDIA_A100_80GB_PCIe`).
+- **Use CPU endpoints** when GPU acceleration isn't needed.
+- **Reduce `idle_timeout`** for sporadic workloads (300-600 seconds / 5-10 minutes).
+- **Batch operations** into fewer job submissions when possible.
+
+### Cost-optimized queue-based endpoint
+
+```python
+from runpod_flash import Endpoint, GpuType, NetworkVolume
+
+@Endpoint(
+ name="batch-job",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, # Cost-effective GPU
+ workers=(0, 5), # Scale to zero, controlled max
+ idle_timeout=300, # 5 minutes - fast scale-down
+ volume=NetworkVolume(name="my-volume") # Avoid re-downloading models
+)
+def batch_process(data): ...
+```
+
+### Cost-optimized CPU endpoint
+
+For workloads that don't require GPU acceleration:
+
+```python
+from runpod_flash import Endpoint
+
+@Endpoint(
+ name="cpu-batch",
+ cpu="cpu5c-4-8", # 4 vCPU, 8GB RAM
+ workers=(0, 3), # Scale to zero, limit to 3
+ idle_timeout=300 # 5 minutes - fast scale-down
+)
+def cpu_process(data): ...
+```
+
+## Configuration trade-offs
+
+Understanding the trade-offs helps you balance cost, latency, and performance:
+
+| Configuration | Cost | Cold Start Latency | Best For |
+|--------------|------|-------------------|----------|
+| `workers=(0, n)` | Lowest | 20-90 seconds first run | Batch jobs, development, infrequent workloads |
+| `workers=(1, n)` | Medium | \<1 second for queued jobs | Production batch, variable traffic |
+| `workers=(3, n)` | Highest | Always ready | Production APIs, high-traffic endpoints |
+
+| GPU Choice | Cost | Availability | Best For |
+|-----------|------|--------------|----------|
+| `GpuGroup.ANY` | Variable | Highest | Development, fastest provisioning |
+| Specific type (e.g., `GpuType.NVIDIA_GEFORCE_RTX_4090`) | Predictable | Medium | Production with specific hardware |
+| Specific type (e.g., `GpuType.NVIDIA_A100_80GB_PCIe`) | Predictable | Lower | Production requiring specific hardware |
+
+## Configuration checklist
+
+Before deploying to production, verify:
+
+- **GPU selection**: Using specific GPU types (not `GpuGroup.ANY`) for predictable performance
+- **Worker scaling**: `workers=(1, n)` or higher min for load balancers and latency-sensitive workloads
+- **Timeouts**: `execution_timeout_ms` set appropriately for your workload
+- **Storage**: Network volume attached if using large models or datasets
+- **Environment variables**: All configuration and secrets passed via `env` parameter
+- **Monitoring**: Health check routes implemented (load balancers)
+- **Testing**: Tested locally with `flash run` before production deployment
\ No newline at end of file
diff --git a/flash/configuration/cpu-types.mdx b/flash/configuration/cpu-types.mdx
new file mode 100644
index 00000000..086df51c
--- /dev/null
+++ b/flash/configuration/cpu-types.mdx
@@ -0,0 +1,129 @@
+---
+title: "CPU types"
+sidebarTitle: "CPU types"
+description: "Available CPU instance types for Flash endpoints."
+---
+
+Flash provides access to CPU-only compute instances for workloads that don't require GPU acceleration. This reference lists all available CPU instance types.
+
+## Using CPU instances
+
+Specify a CPU instance using the `cpu` parameter. You can use either a string shorthand or the `CpuInstanceType` enum:
+
+```python
+from runpod_flash import Endpoint, CpuInstanceType
+
+# String shorthand
+@Endpoint(name="data-processor", cpu="cpu5c-4-8")
+async def process(data: dict) -> dict:
+ ...
+
+# Using enum
+@Endpoint(name="data-processor", cpu=CpuInstanceType.CPU5C_4_8)
+async def process(data: dict) -> dict:
+ ...
+```
+
+## Available CPU instance types
+
+CPU instances are organized by generation and optimization profile.
+
+### 5th generation compute-optimized
+
+Latest generation, optimized for compute-intensive workloads:
+
+| CpuInstanceType | ID | vCPU | RAM | Best For |
+|-----------------|-----|------|-----|----------|
+| `CPU5C_1_2` | cpu5c-1-2 | 1 | 2GB | Lightweight APIs, simple tasks |
+| `CPU5C_2_4` | cpu5c-2-4 | 2 | 4GB | Small APIs, data validation |
+| `CPU5C_4_8` | cpu5c-4-8 | 4 | 8GB | General APIs, data processing |
+| `CPU5C_8_16` | cpu5c-8-16 | 8 | 16GB | Heavy processing, parallel tasks |
+
+### 3rd generation compute-optimized
+
+Balanced compute focus:
+
+| CpuInstanceType | ID | vCPU | RAM | Best For |
+|-----------------|-----|------|-----|----------|
+| `CPU3C_1_2` | cpu3c-1-2 | 1 | 2GB | Basic endpoints, webhooks |
+| `CPU3C_2_4` | cpu3c-2-4 | 2 | 4GB | Simple data processing |
+| `CPU3C_4_8` | cpu3c-4-8 | 4 | 8GB | Moderate workloads |
+| `CPU3C_8_16` | cpu3c-8-16 | 8 | 16GB | CPU-intensive tasks |
+
+### 3rd generation general purpose
+
+Balanced CPU and memory:
+
+| CpuInstanceType | ID | vCPU | RAM | Best For |
+|-----------------|-----|------|-----|----------|
+| `CPU3G_1_4` | cpu3g-1-4 | 1 | 4GB | Memory-light tasks |
+| `CPU3G_2_8` | cpu3g-2-8 | 2 | 8GB | General workloads |
+| `CPU3G_4_16` | cpu3g-4-16 | 4 | 16GB | Memory-intensive processing |
+| `CPU3G_8_32` | cpu3g-8-32 | 8 | 32GB | High-memory workloads |
+
+## Common configurations
+
+### APIs and webhooks
+
+```python
+# Lightweight API
+@Endpoint(name="webhook", cpu="cpu5c-2-4")
+async def handle_webhook(data: dict) -> dict:
+ ...
+
+# Production API
+@Endpoint(name="api", cpu="cpu5c-4-8", workers=(1, 10))
+async def handle_request(data: dict) -> dict:
+ ...
+```
+
+### Data processing
+
+```python
+# Light processing
+@Endpoint(name="processor", cpu="cpu3g-2-8") # More RAM per vCPU
+async def process(data: dict) -> dict:
+ ...
+
+# Heavy processing
+@Endpoint(name="heavy-processor", cpu="cpu5c-8-16")
+async def heavy_process(data: dict) -> dict:
+ ...
+```
+
+### Memory-intensive tasks
+
+```python
+# High memory requirement
+@Endpoint(name="memory-worker", cpu="cpu3g-8-32") # 8 vCPU, 32GB RAM
+async def process_large_data(data: dict) -> dict:
+ ...
+```
+
+### Load-balanced CPU API
+
+```python
+from runpod_flash import Endpoint
+
+api = Endpoint(
+ name="cpu-api",
+ cpu="cpu5c-4-8",
+ workers=(1, 10)
+)
+
+@api.post("/process")
+async def process(data: dict) -> dict:
+ return {"result": "processed"}
+
+@api.get("/health")
+async def health():
+ return {"status": "ok"}
+```
+
+## Container disk sizing
+
+CPU endpoints automatically adjust container disk size based on instance limits:
+- `CPU3G` and `CPU3C` instances: vCPU count × 10GB (e.g., 2 vCPU = 20GB)
+- `CPU5C` instances: vCPU count × 15GB (e.g., 4 vCPU = 60GB)
+
+If you specify a custom size via `PodTemplate` that exceeds the instance limit, deployment will fail with a validation error.
diff --git a/flash/configuration/gpu-types.mdx b/flash/configuration/gpu-types.mdx
new file mode 100644
index 00000000..b51d4e40
--- /dev/null
+++ b/flash/configuration/gpu-types.mdx
@@ -0,0 +1,193 @@
+---
+title: "GPU types"
+sidebarTitle: "GPU types"
+description: "Available GPU pools and specific GPU types for Flash endpoints."
+---
+
+Flash provides access to a wide range of NVIDIA GPUs through both pool-based and specific GPU selection. This page lists all available GPU types and explains how to use them.
+
+## GPU selection methods
+
+Flash offers two ways to specify GPU hardware:
+
+1. [GPU pools](/flash/configuration/gpu-types#gpu-pools) (`GpuGroup`): Select from predefined pools of similar GPUs grouped by architecture and VRAM.
+2. [Specific GPU types](/flash/configuration/gpu-types#specific-gpu-types) (`GpuType`): Target exact GPU models when you need precise hardware characteristics.
+
+You can use either method or mix both for [advanced fallback strategies](/flash/configuration/gpu-types#advanced-fallback-strategies).
+
+## GPU pools
+
+The `GpuGroup` enum provides access to GPU pools. Each pool contains specific GPU models grouped by architecture and VRAM capacity.
+
+### Available GPU pools
+
+| GpuGroup | GPUs Included | VRAM | Best For |
+|----------|---------------|------|----------|
+| `GpuGroup.ANY` | Any available GPU | Varies | Fast provisioning, prototyping |
+| `GpuGroup.AMPERE_16` | RTX A4000, RTX 4000 Ada, RTX 2000 Ada | 16GB | Small models, basic inference |
+| `GpuGroup.AMPERE_24` | RTX A4500, RTX A5000, RTX 3090 | 20-24GB | General ML, mid-size models |
+| `GpuGroup.ADA_24` | L4, RTX 4090 | 24GB | Cost-effective inference |
+| `GpuGroup.ADA_32_PRO` | RTX 5090 | 32GB | Latest consumer flagship |
+| `GpuGroup.AMPERE_48` | A40, RTX A6000 | 48GB | Large models, fine-tuning |
+| `GpuGroup.ADA_48_PRO` | RTX 6000 Ada | 48GB | Professional inference |
+| `GpuGroup.AMPERE_80` | A100 80GB PCIe, A100-SXM4-80GB | 80GB | XL models, intensive training |
+| `GpuGroup.ADA_80_PRO` | H100 80GB HBM3 | 80GB | Cutting-edge inference |
+| `GpuGroup.HOPPER_141` | H200 | 141GB | Largest models, maximum VRAM |
+
+### Using GPU pools
+
+```python
+from runpod_flash import Endpoint, GpuGroup
+
+# Single GPU pool
+@Endpoint(name="inference", gpu=GpuGroup.AMPERE_80)
+async def infer(data: dict) -> dict:
+ ...
+
+# Multiple pools for fallback
+@Endpoint(
+ name="flexible",
+ gpu=[GpuGroup.AMPERE_80, GpuGroup.AMPERE_48, GpuGroup.ADA_24]
+)
+async def flexible_infer(data: dict) -> dict:
+ ...
+
+# Any available GPU (fastest provisioning)
+@Endpoint(name="development", gpu=GpuGroup.ANY)
+async def dev_infer(data: dict) -> dict:
+ ...
+```
+
+## Specific GPU types
+
+The `GpuType` enum provides access to specific GPU models. Use these when you need exact hardware characteristics.
+
+### Available GPU types
+
+| GpuType | GPU Model | VRAM | Architecture |
+|---------|-----------|------|--------------|
+| `GpuType.NVIDIA_RTX_A4000` | NVIDIA RTX A4000 | 16GB | Ampere |
+| `GpuType.NVIDIA_RTX_A4500` | NVIDIA RTX A4500 | 20GB | Ampere |
+| `GpuType.NVIDIA_RTX_4000_ADA_GENERATION` | NVIDIA RTX 4000 Ada | 16GB | Ada Lovelace |
+| `GpuType.NVIDIA_RTX_2000_ADA_GENERATION` | NVIDIA RTX 2000 Ada | 16GB | Ada Lovelace |
+| `GpuType.NVIDIA_RTX_A5000` | NVIDIA RTX A5000 | 24GB | Ampere |
+| `GpuType.NVIDIA_L4` | NVIDIA L4 | 24GB | Ada Lovelace |
+| `GpuType.NVIDIA_GEFORCE_RTX_3090` | NVIDIA GeForce RTX 3090 | 24GB | Ampere |
+| `GpuType.NVIDIA_GEFORCE_RTX_4090` | NVIDIA GeForce RTX 4090 | 24GB | Ada Lovelace |
+| `GpuType.NVIDIA_GEFORCE_RTX_5090` | NVIDIA GeForce RTX 5090 | 32GB | Blackwell |
+| `GpuType.NVIDIA_A40` | NVIDIA A40 | 48GB | Ampere |
+| `GpuType.NVIDIA_RTX_A6000` | NVIDIA RTX A6000 | 48GB | Ampere |
+| `GpuType.NVIDIA_RTX_6000_ADA_GENERATION` | NVIDIA RTX 6000 Ada | 48GB | Ada Lovelace |
+| `GpuType.NVIDIA_A100_80GB_PCIe` | NVIDIA A100 80GB PCIe | 80GB | Ampere |
+| `GpuType.NVIDIA_A100_SXM4_80GB` | NVIDIA A100-SXM4-80GB | 80GB | Ampere |
+| `GpuType.NVIDIA_H100_80GB_HBM3` | NVIDIA H100 80GB HBM3 | 80GB | Hopper |
+| `GpuType.NVIDIA_H200` | NVIDIA H200 | 141GB | Hopper |
+
+### Using specific GPU types
+
+```python
+from runpod_flash import Endpoint, GpuType
+
+# Single specific GPU
+@Endpoint(name="inference", gpu=GpuType.NVIDIA_A100_80GB_PCIe)
+async def infer(data: dict) -> dict:
+ ...
+
+# Multiple specific GPUs (fallback strategy)
+@Endpoint(
+ name="flexible",
+ gpu=[
+ GpuType.NVIDIA_A100_80GB_PCIe, # Try A100 PCIe first
+ GpuType.NVIDIA_A100_SXM4_80GB, # Fall back to A100 SXM4
+ GpuType.NVIDIA_A40 # Final fallback to A40
+ ]
+)
+async def flexible_infer(data: dict) -> dict:
+ ...
+```
+
+## Advanced fallback strategies
+
+Combine `GpuGroup` and `GpuType` for robust availability:
+
+```python
+from runpod_flash import Endpoint, GpuGroup, GpuType
+
+@Endpoint(
+ name="hybrid-selection",
+ gpu=[
+ GpuType.NVIDIA_A100_80GB_PCIe, # Specific GPU first
+ GpuGroup.AMPERE_48, # Pool fallback
+ GpuGroup.ANY # Ultimate fallback
+ ]
+)
+async def infer(data: dict) -> dict:
+ ...
+```
+
+## GPU selection behavior
+
+**Single GPU type:**
+Flash waits for this specific GPU to become available. Jobs stay in queue until capacity is available.
+
+```python
+gpu=GpuGroup.AMPERE_80 # Only A100 80GB
+```
+
+**Multiple GPU types (fallback):**
+Flash attempts to provision in the order specified.
+
+```python
+gpu=[GpuGroup.AMPERE_80, GpuGroup.AMPERE_48, GpuGroup.ADA_24]
+# Tries: A100 → A40/A6000 → RTX 4090
+```
+
+**GpuGroup.ANY:**
+Flash selects the first available GPU based on current capacity.
+
+```python
+gpu=GpuGroup.ANY # Fastest provisioning, unpredictable GPU type
+```
+
+
+**For production**: Use specific GPU types for predictable cost and performance.
+**For development**: Use `GpuGroup.ANY` for fastest iteration.
+
+
+## Multi-GPU workers
+
+Request multiple GPUs per worker using `gpu_count`:
+
+```python
+@Endpoint(
+ name="multi-gpu-training",
+ gpu=GpuGroup.AMPERE_80,
+ gpu_count=4, # Each worker gets 4 GPUs
+ workers=2 # Maximum 2 workers = 8 GPUs total
+)
+async def train(data: dict) -> dict:
+ ...
+```
+
+## Handling unavailability
+
+If requested GPUs are unavailable, jobs stay in queue:
+
+```text
+Initial job status: IN_QUEUE
+[Waiting for capacity...]
+```
+
+**Solutions:**
+
+1. **Add fallback options**: Use multiple GPU types.
+ ```python
+ gpu=[GpuGroup.AMPERE_80, GpuGroup.AMPERE_48, GpuGroup.ADA_24]
+ ```
+
+2. **Use broader selection**: Switch to `GpuGroup.ANY`.
+ ```python
+ gpu=GpuGroup.ANY
+ ```
+
+3. **Contact support**: For capacity guarantees, contact [Runpod support](https://www.runpod.io/contact).
diff --git a/flash/configuration/parameters.mdx b/flash/configuration/parameters.mdx
new file mode 100644
index 00000000..ef3b78e1
--- /dev/null
+++ b/flash/configuration/parameters.mdx
@@ -0,0 +1,533 @@
+---
+title: "Endpoint parameters"
+sidebarTitle: "Parameters"
+description: "Complete reference for all Endpoint class parameters."
+---
+
+This page provides a complete reference for all parameters available on the `Endpoint` class.
+
+## Parameter overview
+
+| Parameter | Type | Description | Default |
+|-----------|------|-------------|---------|
+| `name` | `str` | Endpoint name (required unless `id=` is used) | - |
+| `id` | `str` | Connect to existing endpoint by ID | `None` |
+| `gpu` | `GpuGroup`, `GpuType`, or list | GPU type(s) for the endpoint | `GpuGroup.ANY` |
+| `cpu` | `str` or `CpuInstanceType` | CPU instance type (mutually exclusive with `gpu`) | `None` |
+| `workers` | `int` or `(min, max)` | Worker scaling configuration | `(0, 1)` |
+| `idle_timeout` | `int` | Seconds before scaling down idle workers | `60` |
+| `dependencies` | `list[str]` | Python packages to install | `None` |
+| `system_dependencies` | `list[str]` | System packages to install (apt) | `None` |
+| `accelerate_downloads` | `bool` | Enable download acceleration | `True` |
+| `volume` | `NetworkVolume` | Network volume for persistent storage | `None` |
+| `datacenter` | `DataCenter` | Preferred datacenter | `EU_RO_1` |
+| `env` | `dict[str, str]` | Environment variables | `None` |
+| `gpu_count` | `int` | GPUs per worker | `1` |
+| `execution_timeout_ms` | `int` | Max execution time in milliseconds | `0` (no limit) |
+| `flashboot` | `bool` | Enable Flashboot fast startup | `True` |
+| `image` | `str` | Custom Docker image to deploy | `None` |
+| `scaler_type` | `ServerlessScalerType` | Scaling strategy | auto |
+| `scaler_value` | `int` | Scaling threshold | `4` |
+| `template` | `PodTemplate` | Pod template overrides | `None` |
+
+## Parameter details
+
+### name
+
+**Type**: `str`
+**Required**: Yes (unless `id=` is specified)
+
+The endpoint name visible in the [Runpod console](https://www.runpod.io/console/serverless). Use descriptive names to easily identify endpoints.
+
+```python
+@Endpoint(name="ml-inference-prod", gpu=GpuGroup.ANY)
+async def infer(data): ...
+```
+
+
+Use naming conventions like `image-generation-prod` or `batch-processor-dev` to organize your endpoints.
+
+
+### id
+
+**Type**: `str`
+**Default**: `None`
+
+Connect to an existing deployed endpoint by its ID. When `id` is specified, `name` is not required.
+
+```python
+# Connect to existing endpoint
+ep = Endpoint(id="abc123xyz")
+
+# Make requests
+job = await ep.run({"prompt": "hello"})
+result = await ep.post("/inference", {"data": "..."})
+```
+
+### gpu
+
+**Type**: `GpuGroup`, `GpuType`, or `list[GpuGroup | GpuType]`
+**Default**: `GpuGroup.ANY` (if neither `gpu` nor `cpu` is specified)
+
+Specifies GPU hardware for the endpoint. Accepts a single GPU type/group or a list for fallback strategies.
+
+```python
+from runpod_flash import Endpoint, GpuType, GpuGroup
+
+# Specific GPU type
+@Endpoint(name="inference", gpu=GpuType.NVIDIA_A100_80GB_PCIe)
+async def infer(data): ...
+
+# Another specific GPU type
+@Endpoint(name="rtx-worker", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090)
+async def process(data): ...
+
+# Multiple types for fallback
+@Endpoint(name="flexible", gpu=[GpuType.NVIDIA_A100_80GB_PCIe, GpuType.NVIDIA_RTX_A6000, GpuType.NVIDIA_GEFORCE_RTX_4090])
+async def flexible_infer(data): ...
+```
+
+See [GPU types](/flash/configuration/gpu-types) for all available options.
+
+### cpu
+
+**Type**: `str` or `CpuInstanceType`
+**Default**: `None`
+
+Specifies a CPU instance type. Mutually exclusive with `gpu`.
+
+```python
+from runpod_flash import Endpoint, CpuInstanceType
+
+# String shorthand
+@Endpoint(name="data-processor", cpu="cpu5c-4-8")
+async def process(data): ...
+
+# Using enum
+@Endpoint(name="data-processor", cpu=CpuInstanceType.CPU5C_4_8)
+async def process(data): ...
+```
+
+See [CPU types](/flash/configuration/cpu-types) for all available options.
+
+### workers
+
+**Type**: `int` or `tuple[int, int]`
+**Default**: `(0, 1)`
+
+Controls worker scaling. Accepts either a single integer (max workers with min=0) or a tuple of (min, max).
+
+```python
+# Just max: scales from 0 to 5
+@Endpoint(name="elastic", gpu=GpuGroup.ANY, workers=5)
+
+# Min and max: always keep 2 warm, scale up to 10
+@Endpoint(name="always-on", gpu=GpuGroup.ANY, workers=(2, 10))
+
+# Default: (0, 1)
+@Endpoint(name="default", gpu=GpuGroup.ANY)
+```
+
+**Recommendations**:
+- `workers=N` or `workers=(0, N)`: Cost-optimized, allows scale to zero
+- `workers=(1, N)`: Avoid cold starts by keeping at least one worker warm
+- `workers=(N, N)`: Fixed worker count for consistent performance
+
+### idle_timeout
+
+**Type**: `int`
+**Default**: `60`
+
+Seconds workers stay active with no traffic before scaling down (to minimum workers).
+
+```python
+# Quick scale-down for cost savings
+@Endpoint(name="batch", gpu=GpuGroup.ANY, idle_timeout=30)
+
+# Keep workers longer for variable traffic
+@Endpoint(name="api", gpu=GpuGroup.ANY, idle_timeout=120)
+```
+
+**Recommendations**:
+- `30-60 seconds`: Cost-optimized, infrequent traffic
+- `60-120 seconds`: Balanced, variable traffic patterns
+- `120-300 seconds`: Latency-optimized, consistent traffic
+
+### dependencies
+
+**Type**: `list[str]`
+**Default**: `None`
+
+Python packages to install on the remote worker before executing your function. Supports standard pip syntax.
+
+```python
+@Endpoint(
+ name="ml-worker",
+ gpu=GpuGroup.ANY,
+ dependencies=["torch>=2.0.0", "transformers==4.36.0", "pillow"]
+)
+async def process(data): ...
+```
+
+
+Packages must be imported **inside** the function body, not at the top of your file.
+
+
+### system_dependencies
+
+**Type**: `list[str]`
+**Default**: `None`
+
+System-level packages to install via apt before your function runs.
+
+```python
+@Endpoint(
+ name="video-processor",
+ gpu=GpuGroup.ANY,
+ dependencies=["opencv-python"],
+ system_dependencies=["libgl1-mesa-glx", "libglib2.0-0"]
+)
+async def process_video(data): ...
+```
+
+### accelerate_downloads
+
+**Type**: `bool`
+**Default**: `True`
+
+Enables faster downloads for dependencies, models, and large files. Disable if you encounter compatibility issues.
+
+```python
+@Endpoint(
+ name="standard-downloads",
+ gpu=GpuGroup.ANY,
+ accelerate_downloads=False
+)
+async def process(data): ...
+```
+
+### volume
+
+**Type**: `NetworkVolume`
+**Default**: `None`
+
+Attaches a network volume for persistent storage. Volumes are mounted at `/runpod-volume/`. Flash uses the volume `name` to find an existing volume or create a new one.
+
+```python
+from runpod_flash import Endpoint, GpuGroup, NetworkVolume
+
+vol = NetworkVolume(name="model-cache") # Finds existing or creates new
+
+@Endpoint(
+ name="model-server",
+ gpu=GpuGroup.ANY,
+ volume=vol
+)
+async def serve(data):
+ # Access files at /runpod-volume/
+ model = load_model("/runpod-volume/models/bert")
+ ...
+```
+
+**Use cases**:
+- Share large models across workers
+- Persist data between runs
+- Share datasets across endpoints
+
+See [Storage](/flash/configuration/storage) for setup instructions.
+
+### datacenter
+
+**Type**: `DataCenter`
+**Default**: `DataCenter.EU_RO_1`
+
+Preferred datacenter for worker deployment.
+
+```python
+from runpod_flash import Endpoint, DataCenter
+
+@Endpoint(
+ name="eu-workers",
+ gpu=GpuGroup.ANY,
+ datacenter=DataCenter.EU_RO_1
+)
+async def process(data): ...
+```
+
+
+Flash Serverless deployments are currently restricted to `EU-RO-1`.
+
+
+### env
+
+**Type**: `dict[str, str]`
+**Default**: `None`
+
+Environment variables passed to all workers. Useful for API keys, configuration, and feature flags.
+
+```python
+@Endpoint(
+ name="ml-worker",
+ gpu=GpuGroup.ANY,
+ env={
+ "HF_TOKEN": "your_huggingface_token",
+ "MODEL_ID": "gpt2",
+ "LOG_LEVEL": "INFO"
+ }
+)
+async def load_model():
+ import os
+ token = os.getenv("HF_TOKEN")
+ model_id = os.getenv("MODEL_ID")
+ ...
+```
+
+
+Environment variables are excluded from configuration hashing. Changing environment values won't trigger endpoint recreation, making it easy to rotate API keys.
+
+
+### gpu_count
+
+**Type**: `int`
+**Default**: `1`
+
+Number of GPUs per worker. Use for multi-GPU workloads.
+
+```python
+@Endpoint(
+ name="multi-gpu-training",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ gpu_count=4, # Each worker gets 4 GPUs
+ workers=2 # Maximum 2 workers = 8 GPUs total
+)
+async def train(data): ...
+```
+
+### execution_timeout_ms
+
+**Type**: `int`
+**Default**: `0` (no limit)
+
+Maximum execution time for a single job in milliseconds. Jobs exceeding this timeout are terminated.
+
+```python
+# 5 minute timeout
+@Endpoint(
+ name="training",
+ gpu=GpuGroup.ANY,
+ execution_timeout_ms=300000 # 5 * 60 * 1000
+)
+async def train(data): ...
+
+# 30 second timeout for quick inference
+@Endpoint(
+ name="quick-inference",
+ gpu=GpuGroup.ANY,
+ execution_timeout_ms=30000
+)
+async def infer(data): ...
+```
+
+### flashboot
+
+**Type**: `bool`
+**Default**: `True`
+
+Enables Flashboot for faster cold starts by pre-loading container images.
+
+```python
+@Endpoint(
+ name="fast-startup",
+ gpu=GpuGroup.ANY,
+ flashboot=True # Default
+)
+async def process(data): ...
+```
+
+Set to `False` for debugging or compatibility reasons.
+
+### image
+
+**Type**: `str`
+**Default**: `None`
+
+Custom Docker image to deploy. When specified, the endpoint runs your Docker image instead of Flash's managed workers.
+
+```python
+from runpod_flash import Endpoint, GpuType
+
+vllm = Endpoint(
+ name="vllm-server",
+ image="runpod/worker-vllm:stable-cuda12.1.0",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ env={"MODEL_NAME": "meta-llama/Llama-3.2-3B-Instruct"}
+)
+
+# Make HTTP calls to the deployed image
+result = await vllm.post("/v1/completions", {"prompt": "Hello"})
+```
+
+See [Custom Docker images](/flash/custom-docker-images) for complete documentation.
+
+### scaler_type
+
+**Type**: `ServerlessScalerType`
+**Default**: Auto-selected based on endpoint type
+
+Scaling algorithm strategy. Defaults are automatically set:
+- Queue-based: `QUEUE_DELAY` (scales based on queue depth)
+- Load-balanced: `REQUEST_COUNT` (scales based on active requests)
+
+```python
+from runpod_flash import Endpoint, ServerlessScalerType
+
+@Endpoint(
+ name="custom-scaler",
+ gpu=GpuGroup.ANY,
+ scaler_type=ServerlessScalerType.QUEUE_DELAY
+)
+async def process(data): ...
+```
+
+### scaler_value
+
+**Type**: `int`
+**Default**: `4`
+
+Parameter value for the scaling algorithm. With `QUEUE_DELAY`, represents target jobs per worker before scaling up.
+
+```python
+# Scale up when > 2 jobs per worker (more aggressive)
+@Endpoint(
+ name="responsive",
+ gpu=GpuGroup.ANY,
+ scaler_value=2
+)
+async def process(data): ...
+```
+
+### template
+
+**Type**: `PodTemplate`
+**Default**: `None`
+
+Advanced pod configuration overrides.
+
+```python
+from runpod_flash import Endpoint, GpuGroup, PodTemplate
+
+@Endpoint(
+ name="custom-pod",
+ gpu=GpuGroup.ANY,
+ template=PodTemplate(
+ containerDiskInGb=100,
+ env=[{"key": "PYTHONPATH", "value": "/workspace"}]
+ )
+)
+async def process(data): ...
+```
+
+## PodTemplate
+
+`PodTemplate` provides advanced pod configuration options:
+
+| Parameter | Type | Description | Default |
+|-----------|------|-------------|---------|
+| `containerDiskInGb` | `int` | Container disk size in GB | 64 |
+| `env` | `list[dict]` | Environment variables as list of `{"key": "...", "value": "..."}` | `None` |
+
+```python
+from runpod_flash import PodTemplate
+
+template = PodTemplate(
+ containerDiskInGb=100,
+ env=[
+ {"key": "PYTHONPATH", "value": "/workspace"},
+ {"key": "CUDA_VISIBLE_DEVICES", "value": "0"}
+ ]
+)
+```
+
+
+For simple environment variables, use the `env` parameter on `Endpoint` instead of `PodTemplate.env`.
+
+
+## EndpointJob
+
+When using `Endpoint(id=...)` or `Endpoint(image=...)`, the `.run()` method returns an `EndpointJob` object for async operations:
+
+```python
+ep = Endpoint(id="abc123")
+
+# Submit a job
+job = await ep.run({"prompt": "hello"})
+
+# Check status
+status = await job.status() # "IN_PROGRESS", "COMPLETED", etc.
+
+# Wait for completion
+await job.wait(timeout=60) # Optional timeout in seconds
+
+# Access results
+print(job.id) # Job ID
+print(job.output) # Result payload
+print(job.error) # Error message if failed
+print(job.done) # True if completed/failed
+
+# Cancel a job
+await job.cancel()
+```
+
+## Configuration change behavior
+
+When you change configuration and redeploy, Flash automatically updates your endpoint.
+
+### Changes that recreate workers
+
+These changes restart all workers:
+- GPU configuration (`gpu`, `gpu_count`)
+- CPU instance type (`cpu`)
+- Docker image (`image`)
+- Storage (`volume`)
+- Datacenter (`datacenter`)
+- Flashboot setting (`flashboot`)
+
+Workers are temporarily unavailable during recreation (typically 30-90 seconds).
+
+### Changes that update settings only
+
+These changes apply immediately with no downtime:
+- Worker scaling (`workers`)
+- Timeouts (`idle_timeout`, `execution_timeout_ms`)
+- Scaler settings (`scaler_type`, `scaler_value`)
+- Environment variables (`env`)
+- Endpoint name (`name`)
+
+```python
+# First deployment
+@Endpoint(
+ name="inference-api",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ workers=5,
+ env={"MODEL": "v1"}
+)
+async def infer(data): ...
+
+# Update scaling - no worker recreation
+@Endpoint(
+ name="inference-api",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe, # Same GPU
+ workers=10, # Changed - updates settings only
+ env={"MODEL": "v2"} # Changed - updates settings only
+)
+async def infer(data): ...
+
+# Change GPU type - workers recreated
+@Endpoint(
+ name="inference-api",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, # Changed - triggers recreation
+ workers=10,
+ env={"MODEL": "v2"}
+)
+async def infer(data): ...
+```
diff --git a/flash/configuration/storage.mdx b/flash/configuration/storage.mdx
new file mode 100644
index 00000000..cee44e2a
--- /dev/null
+++ b/flash/configuration/storage.mdx
@@ -0,0 +1,130 @@
+---
+title: "Storage"
+sidebarTitle: "Storage"
+description: "Understand container disk and network volume storage for Flash workloads."
+---
+
+import { NetworkVolumesTooltip, WorkerContainerDiskTooltip } from "/snippets/tooltips.jsx"
+
+Flash workers have access to two types of storage: for temporary data and for persistent, sharable data.
+
+## Container disk
+
+A container disk provides temporary storage that exists only while a worker is running. Each worker gets its own isolated container disk, with a default size of 64GB for GPU endpoints.
+
+You can read and write temporary files to the container disk using standard filesystem operations from within `@Endpoint` functions.
+
+Any file that is *not* written to a network volume (at `/runpod-volume/`) is written to the container disk, and will be erased when the worker stops.
+
+### Configuring container disk size (GPU-only)
+
+Configure container disk size for GPU endpoints using the `template` parameter (default: 64GB).
+
+```python
+from runpod_flash import Endpoint, GpuType, PodTemplate
+
+@Endpoint(
+ name="large-temp-storage",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ template=PodTemplate(containerDiskInGb=100)
+)
+async def process(data: dict) -> dict:
+ # 100GB container disk available
+ ...
+```
+
+### CPU auto-sizing
+
+CPU endpoints automatically adjust container disk size based on instance limits:
+- `CPU3G` and `CPU3C` instances: vCPU count × 10GB (e.g., 2 vCPU = 20GB)
+- `CPU5C` instances: vCPU count × 15GB (e.g., 4 vCPU = 60GB)
+
+If you specify a custom size that exceeds the instance limit, deployment will fail with a validation error.
+
+## Network volumes
+
+Network volumes provide persistent storage that survives worker restarts. Use this to share data between endpoint functions with the same network volume attached, or to persist data between runs.
+
+### Attaching network volumes
+
+Attach a network volume using the `volume` parameter. Flash uses the volume `name` to find an existing volume or create a new one:
+
+```python
+from runpod_flash import Endpoint, GpuType, NetworkVolume
+
+vol = NetworkVolume(name="model-cache") # Finds existing or creates new
+
+@Endpoint(
+ name="persistent-storage",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ volume=vol
+)
+async def process(data: dict) -> dict:
+ # Access files at /runpod-volume/
+ ...
+```
+
+### Accessing network volume files
+
+Network volumes mount at `/runpod-volume/` and can be accessed like a regular filesystem:
+
+```python
+from runpod_flash import Endpoint, GpuType, NetworkVolume
+
+vol = NetworkVolume(name="model-storage")
+
+@Endpoint(
+ name="model-server",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ volume=vol,
+ dependencies=["torch", "transformers"]
+)
+async def run_inference(prompt: str) -> dict:
+ from transformers import AutoModelForCausalLM, AutoTokenizer
+
+ # Load model from network volume
+ # Persists across worker restarts and shared between workers
+ model_path = "/runpod-volume/models/llama-7b"
+ model = AutoModelForCausalLM.from_pretrained(model_path)
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
+
+ # Run inference
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
+ outputs = model.generate(**inputs, max_length=100)
+ text = tokenizer.decode(outputs[0])
+
+ return {"generated_text": text}
+```
+
+### Load-balanced endpoints with storage
+
+```python
+from runpod_flash import Endpoint, GpuType, NetworkVolume
+
+vol = NetworkVolume(name="model-storage")
+
+api = Endpoint(
+ name="inference-api",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
+ volume=vol,
+ workers=(1, 5)
+)
+
+@api.post("/generate")
+async def generate(prompt: str) -> dict:
+ from transformers import AutoModelForCausalLM
+
+ model = AutoModelForCausalLM.from_pretrained("/runpod-volume/models/gpt2")
+ # Generate text
+ return {"text": "generated"}
+
+@api.get("/models")
+async def list_models() -> dict:
+ import os
+ models = os.listdir("/runpod-volume/models")
+ return {"models": models}
+```
+
+### Creating and managing network volumes
+
+Network volumes must be created before attaching them to an Endpoint. See [Network volumes](/storage/network-volumes) for detailed instructions.
diff --git a/flash/create-endpoints.mdx b/flash/create-endpoints.mdx
new file mode 100644
index 00000000..c58dfcb3
--- /dev/null
+++ b/flash/create-endpoints.mdx
@@ -0,0 +1,360 @@
+---
+title: "Create endpoints"
+sidebarTitle: "Create endpoints"
+description: "Learn how to create and configure hardware and scaling behavior with the Flash Endpoint class."
+---
+
+import { WorkerTooltip, ServerlessTooltip, NetworkVolumesTooltip } from "/snippets/tooltips.jsx";
+
+In Flash, endpoints are the bridge between your local Python functions and Runpod's cloud infrastructure. When you decorate a function with `@Endpoint`, you're marking it to run remotely on Runpod instead of your local machine:
+
+```python
+from runpod_flash import Endpoint, GpuType
+
+@Endpoint(
+ name="my-inference",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
+ dependencies=["torch"]
+)
+def run_model(data):
+ import torch
+ # This code runs on a Runpod GPU, not locally
+ return {"result": "processed"}
+```
+
+When you call `run_model(data)`, Flash provisions a GPU on Runpod (or reuses an existing one), sends your function code and input to the worker, executes it, and returns the result to your local environment.
+
+Each unique endpoint `name` creates one Serverless endpoint on Runpod with its own URL, scaling configuration, and hardware allocation. The endpoint manages workers that scale up and down based on demand.
+
+## Endpoint types
+
+The `Endpoint` class supports four distinct patterns.
+
+### Queue-based endpoints
+
+Use `@Endpoint(...)` as a decorator for batch processing and async workloads. Each function gets its own endpoint with dedicated workers.
+
+```python
+from runpod_flash import Endpoint, GpuType
+
+@Endpoint(
+ name="image-processor",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
+ workers=(0, 5),
+ dependencies=["torch", "pillow"]
+)
+async def process_image(image_data: dict) -> dict:
+ import torch
+ from PIL import Image
+ # Process image on GPU
+ return {"processed": True}
+```
+
+Queue-based endpoints are ideal for:
+- Batch processing jobs
+- Long-running computations
+- Workloads that don't need immediate responses
+
+### Load-balanced endpoints
+
+Use `Endpoint(...)` as an instance with route decorators for HTTP APIs. Multiple routes share the same workers.
+
+```python
+from runpod_flash import Endpoint, GpuType
+
+api = Endpoint(
+ name="inference-api",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
+ workers=(1, 5)
+)
+
+@api.post("/predict")
+async def predict(data: dict) -> dict:
+ import torch
+ # Run inference
+ return {"prediction": "result"}
+
+@api.get("/health")
+async def health():
+ return {"status": "ok"}
+```
+
+Load-balanced endpoints are ideal for:
+- REST APIs with multiple routes
+- Low-latency request/response patterns
+- Services requiring custom HTTP methods
+
+### Custom Docker images
+
+Deploy pre-built Docker images (like vLLM or your own workers) and interact with them as a client:
+
+```python
+from runpod_flash import Endpoint, GpuType
+
+vllm = Endpoint(
+ name="vllm-server",
+ image="vllm/vllm-openai:latest",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe
+)
+
+# Make HTTP calls to the deployed image
+result = await vllm.post("/v1/completions", {"prompt": "Hello"})
+models = await vllm.get("/v1/models")
+```
+
+See [Custom Docker images](/flash/custom-docker-images) for complete documentation, including available images and configuration options.
+
+### Existing endpoints
+
+Connect to an already-deployed Runpod endpoint by ID:
+
+```python
+from runpod_flash import Endpoint
+
+ep = Endpoint(id="abc123")
+
+# Queue-based calls
+job = await ep.run({"prompt": "hello"})
+await job.wait()
+print(job.output)
+
+# Or load-balanced calls
+result = await ep.post("/v1/completions", {"prompt": "hello"})
+```
+
+## GPU vs CPU
+
+Specify `gpu=` for GPU endpoints or `cpu=` for CPU endpoints. They are mutually exclusive.
+
+### GPU endpoints
+
+```python
+from runpod_flash import Endpoint, GpuType, GpuGroup
+
+# Use a specific GPU type
+@Endpoint(name="ml-inference", gpu=GpuType.NVIDIA_A100_80GB_PCIe)
+async def infer(data: dict) -> dict: ...
+
+# Use another specific GPU type
+@Endpoint(name="rtx-worker", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090)
+async def render(data: dict) -> dict: ...
+
+# Use multiple GPU types for better availability
+@Endpoint(name="flexible", gpu=[GpuType.NVIDIA_GEFORCE_RTX_4090, GpuType.NVIDIA_RTX_A5000])
+async def process(data: dict) -> dict: ...
+```
+
+If neither `gpu=` nor `cpu=` is specified, GPU defaults to `GpuGroup.ANY`.
+
+### CPU endpoints
+
+```python
+from runpod_flash import Endpoint, CpuInstanceType
+
+# Use string shorthand
+@Endpoint(name="data-processor", cpu="cpu5c-4-8")
+async def process(data: dict) -> dict: ...
+
+# Or use the enum
+@Endpoint(name="data-processor", cpu=CpuInstanceType.CPU5C_4_8)
+async def process(data: dict) -> dict: ...
+```
+
+See [GPU types](/flash/configuration/gpu-types) and [CPU types](/flash/configuration/cpu-types) for available options.
+
+## Worker scaling
+
+Control how many workers run for your endpoint with the `workers` parameter:
+
+```python
+# Just a max: scales from 0 to 5
+@Endpoint(name="elastic", gpu=GpuGroup.ANY, workers=5)
+
+# Min and max tuple: always keep 2 warm, scale up to 10
+@Endpoint(name="always-on", gpu=GpuGroup.ANY, workers=(2, 10))
+
+# Default is (0, 1) if not specified
+@Endpoint(name="default", gpu=GpuGroup.ANY)
+```
+
+Setting `workers=(1, N)` keeps at least one worker warm, avoiding cold starts.
+
+## Dependency management
+
+Specify Python packages in the `dependencies` parameter. Flash installs these on the remote worker before executing your function.
+
+```python
+@Endpoint(
+ name="text-gen",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ dependencies=["transformers==4.36.0", "torch", "pillow"]
+)
+def generate_text(prompt):
+ from transformers import pipeline
+ import torch
+ # Your code here
+```
+
+### Version pinning
+
+Use standard pip syntax for version constraints:
+
+```python
+dependencies=["transformers==4.36.0", "torch>=2.0.0", "numpy<2.0"]
+```
+
+### Import packages inside the function body
+
+You must import packages **inside the decorated function body**, not at the top of your file. This ensures imports happen on the remote worker.
+
+**Correct:** imports inside the function.
+```python
+@Endpoint(name="compute", gpu=GpuGroup.ANY, dependencies=["numpy"])
+def compute(data):
+ import numpy as np # Import here
+ return np.sum(data)
+```
+
+**Incorrect:** imports at top of file won't work.
+```python
+import numpy as np # This import happens locally, not on the worker
+
+@Endpoint(name="compute", gpu=GpuGroup.ANY, dependencies=["numpy"])
+def compute(data):
+ return np.sum(data) # numpy not available on the remote worker
+```
+
+### System dependencies
+
+Use `system_dependencies` to install system-level packages (via apt):
+
+```python
+@Endpoint(
+ name="video-processor",
+ gpu=GpuGroup.ANY,
+ dependencies=["opencv-python"],
+ system_dependencies=["libgl1-mesa-glx", "libglib2.0-0"]
+)
+async def process_video(video_data):
+ import cv2
+ # OpenCV processing
+ return {"processed": True}
+```
+
+## Parallel execution
+
+Endpoint functions are async. Use Python's `asyncio` to run multiple operations concurrently:
+
+```python
+import asyncio
+
+async def main():
+ # Run three functions in parallel
+ results = await asyncio.gather(
+ process_item(item1),
+ process_item(item2),
+ process_item(item3)
+ )
+ return results
+```
+
+This is useful for:
+- Batch processing multiple inputs
+- Running different models on the same data
+- Parallelizing independent pipeline stages
+
+## Environment variables
+
+Pass environment variables using the `env` parameter:
+
+```python
+@Endpoint(
+ name="api-worker",
+ gpu=GpuGroup.ANY,
+ env={
+ "HF_TOKEN": "your_huggingface_token",
+ "MODEL_ID": "gpt2"
+ }
+)
+async def load_model():
+ import os
+ from transformers import AutoModel
+
+ hf_token = os.getenv("HF_TOKEN")
+ model_id = os.getenv("MODEL_ID")
+
+ model = AutoModel.from_pretrained(model_id, token=hf_token)
+ return {"model_loaded": model_id}
+```
+
+
+Environment variables are excluded from configuration hashing. Changing environment values won't trigger endpoint recreation, making it easy to rotate API keys.
+
+
+## Persistent storage
+
+Attach a network volume for persistent storage across workers. Flash uses the volume `name` to find an existing volume or create a new one:
+
+```python
+from runpod_flash import Endpoint, GpuGroup, NetworkVolume
+
+vol = NetworkVolume(name="model-cache") # Finds existing or creates new
+
+@Endpoint(
+ name="model-server",
+ gpu=GpuGroup.ANY,
+ volume=vol
+)
+async def serve(data: dict) -> dict:
+ # Access files at /runpod-volume/
+ ...
+```
+
+See [Flash storage](/flash/configuration/storage) for setup instructions.
+
+## Endpoint parameters
+
+For a complete list of parameters available for the `Endpoint` class, see [Endpoint parameters](/flash/configuration/parameters).
+
+## Working with jobs (client mode)
+
+When using `Endpoint(id=...)` or `Endpoint(image=...)`, you get an `EndpointJob` object for async operations:
+
+```python
+ep = Endpoint(id="abc123")
+
+# Submit a job
+job = await ep.run({"prompt": "hello"})
+
+# Check status
+status = await job.status() # "IN_PROGRESS", "COMPLETED", etc.
+
+# Wait for completion
+await job.wait(timeout=60) # Optional timeout in seconds
+
+# Access results
+print(job.id) # Job ID
+print(job.output) # Result payload
+print(job.error) # Error message if failed
+print(job.done) # True if completed/failed
+
+# Cancel a job
+await job.cancel()
+```
+
+## Next steps
+
+
+
+ Deploy pre-built Docker images with Flash.
+
+
+ Create production APIs with Flash apps.
+
+
+ Deploy Flash applications for production.
+
+
+ Remove development endpoints when done testing.
+
+
diff --git a/flash/custom-docker-images.mdx b/flash/custom-docker-images.mdx
new file mode 100644
index 00000000..57a6f133
--- /dev/null
+++ b/flash/custom-docker-images.mdx
@@ -0,0 +1,291 @@
+---
+title: "Use custom containers with Flash"
+sidebarTitle: "Custom containers"
+description: "Deploy pre-built Docker images with Flash using Endpoint."
+---
+
+The `@Endpoint` decorator handles most use cases, allowing you to execute arbitrary Python code remotely without managing Docker images.
+
+However, for specialized environments that require custom Docker images, you can use `Endpoint(image=...)` to deploy your own Docker images.
+
+## When to use custom Docker images
+
+Use custom Docker images when you need:
+
+- **Pre-built inference servers**: vLLM, TensorRT-LLM, or other specialized serving frameworks.
+- **System-level dependencies**: Custom CUDA versions, cuDNN, or system libraries not installable via `pip`.
+- **Baked-in models**: Large models pre-downloaded in the image to avoid runtime downloads.
+- **Existing Serverless workers**: You already have a working Runpod Serverless Docker image.
+
+
+For most use cases, use `@Endpoint` with the `dependencies` parameter. It's simpler, faster, and lets you execute arbitrary Python code remotely.
+
+
+## Available Docker images
+
+### Official Runpod workers
+
+Runpod provides pre-built worker images for common frameworks:
+
+| Framework | Image name | Documentation |
+|-----------|-------|---------------|
+| vLLM | `runpod/worker-vllm` | [vLLM docs](/serverless/vllm/overview) |
+| Automatic1111 | `runpod/worker-a1111:stable` | [A1111 docs](/serverless/workers/sdxl-a1111) |
+| ComfyUI | `runpod/worker-comfy` | [Docker Hub](https://hub.docker.com/r/runpod/worker-comfyui) |
+
+### Custom images
+
+To create a custom Docker image:
+
+1. [Build a handler function](/serverless/workers/handler-functions) to process requests.
+2. [Create a Dockerfile](/serverless/workers/create-dockerfile) to build the image.
+3. [Push the image to a registry](/serverless/workers/deploy).
+4. Reference the image with `Endpoint(image=...)`.
+
+## Deploy a custom image
+
+
+
+
+```python
+from runpod_flash import Endpoint, GpuType
+
+vllm = Endpoint(
+ name="my-vllm-server",
+ image="runpod/worker-vllm:stable-cuda12.1.0",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
+ workers=3,
+ env={
+ "MODEL_NAME": "microsoft/Phi-3.5-mini-instruct",
+ "MAX_MODEL_LEN": "4096"
+ }
+)
+```
+
+
+
+
+Use HTTP methods to call your deployed image:
+
+```python
+import asyncio
+
+async def main():
+ # POST request
+ result = await vllm.post("/v1/completions", {
+ "prompt": "Explain quantum computing:",
+ "max_tokens": 100
+ })
+ print(result)
+
+ # GET request
+ models = await vllm.get("/v1/models")
+ print(models)
+
+asyncio.run(main())
+```
+
+Or use queue-based calls:
+
+```python
+import asyncio
+
+async def main():
+ # Submit job to queue
+ job = await vllm.run({
+ "input": {
+ "prompt": "Explain quantum computing:",
+ "max_tokens": 100
+ }
+ })
+
+ # Wait for completion
+ await job.wait()
+ print(job.output)
+
+asyncio.run(main())
+```
+
+
+
+## Complete example: vLLM inference
+
+This example deploys vLLM and makes inference requests:
+
+```python
+import asyncio
+from runpod_flash import Endpoint, GpuType
+
+# Configure vLLM endpoint
+vllm = Endpoint(
+ name="vllm-phi",
+ image="runpod/worker-vllm:stable-cuda12.1.0",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
+ workers=3,
+ env={
+ "MODEL_NAME": "microsoft/Phi-3.5-mini-instruct",
+ "MAX_MODEL_LEN": "4096",
+ "GPU_MEMORY_UTILIZATION": "0.9",
+ "MAX_CONCURRENCY": "30",
+ }
+)
+
+async def main():
+ # Generate text using queue-based call
+ job = await vllm.run({
+ "input": {
+ "prompt": "Explain quantum computing in simple terms:",
+ "max_tokens": 100,
+ "temperature": 0.7
+ }
+ })
+
+ await job.wait()
+
+ # Extract the generated text
+ text = job.output[0]['choices'][0]['tokens'][0]
+ print(f"Generated text: {text}")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+## Configuration options
+
+All standard `Endpoint` parameters work with custom images:
+
+```python
+from runpod_flash import Endpoint, GpuType, NetworkVolume, PodTemplate
+
+vol = NetworkVolume(name="model-storage")
+
+vllm = Endpoint(
+ name="custom-vllm",
+ image="your-registry/image:tag",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ workers=(0, 5),
+ idle_timeout=600, # 10 minutes
+ env={
+ "MODEL_PATH": "/models/llama",
+ "MAX_BATCH_SIZE": "32"
+ },
+ volume=vol,
+ execution_timeout_ms=300000, # 5 minutes
+ template=PodTemplate(containerDiskInGb=100)
+)
+```
+
+### CPU endpoints
+
+For CPU workloads, use the `cpu` parameter:
+
+```python
+from runpod_flash import Endpoint
+
+cpu_worker = Endpoint(
+ name="cpu-worker",
+ image="your-registry/cpu-worker:latest",
+ cpu="cpu5c-4-8" # 4 vCPU, 8GB RAM
+)
+```
+
+## Request/response format
+
+### Queue-based requests
+
+Use `.run()` with a dictionary payload in the format `{"input": {...}}`:
+
+```python
+job = await endpoint.run({
+ "input": {
+ "param1": "value1",
+ "param2": "value2"
+ }
+})
+
+await job.wait()
+print(job.output) # Worker response
+print(job.error) # Error message if failed
+```
+
+### HTTP requests
+
+Use `.get()`, `.post()`, `.put()`, `.delete()` for direct HTTP calls:
+
+```python
+# POST request
+result = await endpoint.post("/v1/completions", {"prompt": "Hello"})
+
+# GET request
+models = await endpoint.get("/v1/models")
+
+# With custom headers
+result = await endpoint.post(
+ "/v1/completions",
+ {"prompt": "Hello"},
+ headers={"X-Custom-Header": "value"}
+)
+```
+
+## EndpointJob reference
+
+The `.run()` method returns an `EndpointJob` for async operations:
+
+```python
+job = await endpoint.run({"input": {...}})
+
+# Properties
+job.id # Job ID
+job.output # Result payload (after completion)
+job.error # Error message if failed
+job.done # True if completed/failed
+
+# Methods
+await job.status() # Get current status
+await job.wait(timeout=60) # Wait for completion
+await job.cancel() # Cancel the job
+```
+
+## Limitations
+
+- **Input format**: Queue-based calls require `{"input": {...}}` format.
+- **Code execution**: Cannot execute arbitrary Python code remotely. Your Docker image must include all logic.
+- **@Endpoint decorator**: The decorator pattern doesn't work with `image=`. Use the instance pattern instead.
+- **Handler required**: Your Docker image must implement a Runpod Serverless [handler function](/serverless/workers/handler-functions).
+
+## Troubleshooting
+
+### Endpoint fails to initialize
+
+**Problem**: Workers fail to start or crash immediately.
+
+**Solutions**:
+- Verify your Docker image is compatible with [Runpod Serverless](/serverless/overview).
+- Check environment variables are correct.
+- Ensure the image includes a valid handler function.
+- Check worker logs in the [Runpod console](https://www.runpod.io/console/serverless).
+
+### Out of memory errors
+
+**Problem**: Workers crash with CUDA OOM or RAM errors.
+
+**Solutions**:
+- Use a larger GPU: `gpu=GpuType.NVIDIA_A100_80GB_PCIe`
+- Reduce `GPU_MEMORY_UTILIZATION` for vLLM.
+- Lower `MAX_MODEL_LEN` or batch size.
+- Reduce `workers` to limit parallel execution.
+
+### Authentication errors
+
+**Problem**: Cannot download gated models or private images.
+
+**Solutions**:
+- Add `HF_TOKEN` to `env` for Hugging Face gated models.
+- Configure Docker registry authentication in [Runpod console](https://www.runpod.io/console/user/settings) for private images.
+
+## Next steps
+
+- [View all Endpoint parameters](/flash/configuration/parameters)
+- [Learn about vLLM deployment](/serverless/vllm/overview)
+- [Build custom Serverless workers](/serverless/workers/overview)
+- [Create Flash apps](/flash/apps/build-app)
diff --git a/flash/execution-model.mdx b/flash/execution-model.mdx
new file mode 100644
index 00000000..308f29ee
--- /dev/null
+++ b/flash/execution-model.mdx
@@ -0,0 +1,186 @@
+---
+title: "Execution model"
+sidebarTitle: "Execution model"
+description: "Understand how Flash executes your code on Runpod's infrastructure."
+---
+
+import { MachineTooltip } from "/snippets/tooltips.jsx";
+
+Flash runs your Python functions on remote GPU/CPU workers while you maintain local control flow. This page explains what happens when you call an `@Endpoint` function.
+
+## What runs where
+
+The `@Endpoint` decorator marks functions for remote execution. Everything else runs locally.
+
+```python
+import asyncio
+from runpod_flash import Endpoint, GpuType
+
+@Endpoint(name="demo", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090)
+def process_on_gpu(data):
+ # This runs on Runpod worker
+ import torch
+ return {"result": "processed"}
+
+async def main():
+ # This runs on your machine
+ result = await process_on_gpu({"input": "data"})
+ print(result) # This runs on your machine
+
+if __name__ == "__main__":
+ asyncio.run(main()) # This runs on your machine
+```
+
+| Code | Location |
+|------|----------|
+| `@Endpoint` decorator | Your machine (marks function) |
+| Inside `process_on_gpu` | Runpod worker |
+| Everything else | Your machine |
+
+### Flash apps
+
+When you build a [Flash app](/flash/apps/overview):
+
+**Development (`flash run`)**:
+- FastAPI server runs **locally**.
+- `@Endpoint` functions run on **Runpod workers**.
+
+**Production (`flash deploy`)**:
+- Each endpoint configuration becomes a **separate Serverless endpoint**.
+- All endpoints run on **Runpod**.
+
+## Execution flow
+
+Here's what happens when you call an `@Endpoint` function:
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'14px','fontFamily':'font-inter'}}}%%
+
+sequenceDiagram
+ participant Local as Your Machine
+ participant Flash as Flash SDK
+ participant Runpod as Runpod API
+ participant Worker as Remote Worker
+
+ Local->>Flash: Call remote function
+ Flash->>Flash: Look up endpoint by name
+ Flash->>Runpod: Check for existing endpoint
+
+ alt Endpoint exists
+ Runpod-->>Flash: Return endpoint ID
+ else New endpoint needed
+ Flash->>Runpod: Create endpoint
+ Runpod-->>Flash: Return endpoint ID
+ end
+
+ Flash->>Flash: Serialize function + args
+ Flash->>Runpod: Submit job
+ Runpod->>Worker: Route to worker
+
+ Worker->>Worker: Execute function
+ Worker->>Runpod: Return result
+
+ Runpod-->>Flash: Return result
+ Flash-->>Local: Return Python object
+```
+
+## Endpoint naming
+
+Flash identifies endpoints by their `name` parameter:
+
+```python
+@Endpoint(
+ name="inference", # This identifies the endpoint
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ workers=3
+)
+def run_inference(data): ...
+```
+
+- **Same name, same config**: Reuses the existing endpoint.
+- **Same name, different config**: Updates the endpoint automatically.
+- **New name**: Creates a new endpoint.
+
+This means you can change parameters like `workers` without creating a new endpoint—Flash detects the change and updates it.
+
+## Worker lifecycle
+
+Workers scale up and down based on demand and your configuration.
+
+### Worker states
+
+**Initializing**: The worker is starting up and downloading dependencies.
+
+**Idle**: The worker is ready but not processing requests.
+
+**Running**: The worker actively processes requests.
+
+**Throttled**: The worker is temporarily unable to run due to host resource constraints.
+
+**Outdated**: The system marks the worker for replacement after endpoint updates. It continues processing current jobs during rolling updates (10% of max workers at a time).
+
+**Unhealthy**: The worker has crashed due to Docker image issues, incorrect start commands, or machine problems. The system automatically retries with exponential backoff for up to 7 days.
+
+### Scaling behavior
+
+```python
+@Endpoint(
+ name="demo",
+ gpu=GpuGroup.ANY,
+ workers=(0, 5), # (min, max) - Scale to zero when idle, up to 5 workers
+ idle_timeout=60 # Seconds before idle workers scale down
+)
+def process(data): ...
+```
+
+**Example**:
+1. First job arrives → Scale to 1 worker (cold start).
+2. More jobs arrive while worker busy → Scale up to max workers.
+3. Jobs complete → Workers stay idle for `idle_timeout`.
+4. No new jobs → Scale down to min workers.
+
+## Cold starts and warm starts
+
+Understanding cold and warm starts helps you predict latency and set expectations.
+
+### Cold start
+
+A cold start occurs when no workers are available to handle your job:
+
+- You're calling an endpoint for the first time.
+- All workers scaled down after being idle beyond `idle_timeout`.
+- All active workers are busy and a new one must spin up.
+
+**What happens during a cold start**:
+1. Runpod provisions a new worker with your configured GPU/CPU.
+2. The worker image starts (dependencies are pre-installed during build).
+3. Your function executes.
+
+**Typical timing**: 10-60 seconds total, depending on GPU availability and image size.
+
+
+When using `flash build` or `flash deploy`, dependencies are pre-installed in the worker image, eliminating pip installation at request time. When running standalone scripts with `@Endpoint` functions outside of a Flash app, dependencies may be installed on the worker at request time.
+
+
+### Warm start
+
+A warm start occurs when a worker is already running and idle:
+
+- Worker completed a previous job and is waiting for more work.
+- Worker is within its `idle_timeout` period.
+
+**What happens during a warm start**:
+1. Job is routed immediately to the idle worker.
+2. Your function executes.
+
+**Typical timing**: ~1 second + your function's execution time.
+
+### The relationship between configuration and starts
+
+Your `workers` and `idle_timeout` settings directly affect cold start frequency:
+
+- `workers=(0, n)`: Workers scale to zero when idle. Every request after idle period triggers a cold start.
+- `workers=(1, n)`: At least one worker stays ready. First concurrent request is warm, additional requests may cold start.
+- Higher `idle_timeout`: Workers stay idle longer before scaling down, reducing cold starts for sporadic traffic.
+
+See [configuration best practices](/flash/configuration/best-practices) for specific recommendations based on your workload.
diff --git a/flash/overview.mdx b/flash/overview.mdx
new file mode 100644
index 00000000..ba3fa401
--- /dev/null
+++ b/flash/overview.mdx
@@ -0,0 +1,124 @@
+---
+title: "Overview"
+sidebarTitle: "Overview"
+description: "Build autoscaling AI/ML apps using local code with Runpod Flash."
+tag: "BETA"
+mode: "wide"
+---
+
+import { ServerlessTooltip, PodsTooltip, WorkersTooltip, LoadBalancingEndpointsTooltip, QueueBasedEndpointsTooltip, EndpointsTooltip } from "/snippets/tooltips.jsx";
+
+
+
+
+Flash is currently in beta. [Join our Discord](https://discord.gg/cUpRmau42V) to provide feedback and get support.
+
+
+Flash is a Python SDK for developing cloud-native AI apps where you define everything—hardware, remote functions, and dependencies—using local code.
+
+```python
+import asyncio
+from runpod_flash import Endpoint, GpuType
+
+# Mark the function below for remote execution
+@Endpoint(name="hello-gpu", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, dependencies=["torch"])
+async def hello(): # This function runs on Runpod
+ import torch
+ gpu_name = torch.cuda.get_device_name(0)
+ print(f"Hello from your GPU! ({gpu_name})")
+ return {"gpu": gpu_name}
+
+asyncio.run(hello())
+print("Done!") # This runs locally
+```
+
+Write `@Endpoint` decorated Python functions on your local machine. Run them, and Flash automatically handles GPU/CPU provisioning and worker scaling on [Runpod Serverless](/serverless/overview).
+
+## Get started
+
+
+
+ Write a Flash script for instant access to Runpod GPUs.
+
+
+ Learn how to create endpoints of various types.
+
+
+ Browse example Flash scripts and apps on GitHub.
+
+
+
+## Setup
+
+### Install Flash
+
+
+
+Flash requires [Python 3.10, 3.11, or 3.12](https://www.python.org/downloads/) (Python 3.13+ is not yet supported), and is currently available for macOS and Linux.
+
+
+Install Flash using `pip` or `uv`:
+
+```bash
+# Install with pip
+pip install runpod-flash
+
+# Or uv
+uv add runpod-flash
+```
+
+### Authentication
+
+Before you can use Flash, you need to authenticate with your Runpod account:
+
+```bash
+flash login
+```
+
+This saves your API key securely and allows you to use the Flash CLI and run `@Endpoint` functions.
+
+### Coding agent integration (optional)
+
+Install the Flash skill package for AI coding agents like Claude Code, Cline, and Cursor:
+
+```bash
+npx skills add runpod/skills
+```
+
+You can review the `SKILL.md` file in the [runpod/skills repository](https://github.com/runpod/skills/blob/main/flash/SKILL.md).
+
+## Flash apps
+
+When you're ready to move beyond scripts and build a production-ready API, you can create a [Flash app](/flash/apps/overview) (a collection of interconnected endpoints with diverse hardware configurations) and deploy it to Runpod.
+
+[Follow this tutorial to build your first Flash app](/flash/apps/build-app).
+
+## Flash CLI
+
+The Flash CLI provides a set of commands for managing your Flash apps and endpoints.
+
+```bash
+flash --help
+```
+
+[Learn more about the Flash CLI](/flash/cli/overview).
+
+## Limitations
+
+- Flash is currently only available for macOS and Linux. Windows support is in development.
+- Serverless deployments using Flash are currently restricted to the `EU-RO-1` datacenter.
+- Flash can rapidly scale workers across multiple endpoints, and you may hit your maximum worker threshold quickly. Contact [Runpod support](https://www.runpod.io/contact) to increase your account's capacity if needed.
+
+## Tutorials
+
+
+
+ Build a GPU-accelerated image generation service.
+
+
+ Deploy a text generation model on Runpod.
+
+
+ Create HTTP endpoints with load balancing.
+
+
\ No newline at end of file
diff --git a/flash/pricing.mdx b/flash/pricing.mdx
new file mode 100644
index 00000000..028da5cf
--- /dev/null
+++ b/flash/pricing.mdx
@@ -0,0 +1,118 @@
+---
+title: "Pricing"
+sidebarTitle: "Pricing"
+description: "Understand Flash pricing and optimize your costs."
+---
+
+Flash follows the same pricing model as [Runpod Serverless](/serverless/pricing). You pay per second of compute time, with no charges when your code isn't running. Pricing depends on the GPU or CPU type you configure for your endpoints.
+
+## How pricing works
+
+You're billed from when a worker starts until it completes your request, plus any idle time before scaling down. If a worker is already warm, you skip the cold start and only pay for execution time.
+
+### Compute cost breakdown
+
+Flash workers incur charges during these periods:
+
+1. **Start time**: The time required to initialize a worker and load models into GPU memory. This includes starting the container, installing dependencies, and preparing the runtime environment.
+2. **Execution time**: The time spent processing your request (running your `@Endpoint` decorated function).
+3. **Idle time**: The period a worker remains active after completing a request, waiting for additional requests before scaling down.
+
+### Pricing by resource type
+
+Flash supports both GPU and CPU workers. Pricing varies based on the hardware type:
+
+- **GPU workers**: Use `@Endpoint(gpu=...)` configuration. Pricing depends on the GPU type (e.g., RTX 4090, A100 80GB).
+- **CPU workers**: Use `@Endpoint(cpu=...)` configuration. Pricing depends on the CPU instance type.
+
+See the [Serverless pricing page](/serverless/pricing) for current rates by GPU and CPU type.
+
+## How to estimate and optimize costs
+
+To estimate costs for your Flash workloads, consider:
+
+- How long each function takes to execute.
+- How many concurrent workers you need (`workers` setting).
+- Which GPU or CPU types you'll use.
+- Your idle timeout configuration (`idle_timeout` setting).
+
+### Cost optimization strategies
+
+#### Choose appropriate hardware
+
+Select the smallest GPU or CPU that meets your performance requirements. For example, if your workload fits in 24GB of VRAM, use an RTX 4090 or L4 instead of larger GPUs like the A100.
+
+```python
+from runpod_flash import Endpoint, GpuType
+
+# Cost-effective configuration for workloads that fit in 24GB VRAM
+@Endpoint(
+ name="cost-optimized",
+ gpu=[GpuType.NVIDIA_GEFORCE_RTX_4090, GpuType.NVIDIA_L4]
+)
+def process(data): ...
+```
+
+#### Configure idle timeouts
+
+Balance responsiveness and cost by adjusting the `idle_timeout` parameter. Shorter timeouts reduce idle costs but increase cold starts for sporadic traffic.
+
+```python
+# Lower idle timeout for cost savings (more cold starts)
+@Endpoint(
+ name="low-idle",
+ gpu=GpuGroup.ANY,
+ idle_timeout=5 # 5 seconds
+)
+def process(data): ...
+
+# Higher idle timeout for responsiveness (higher idle costs)
+@Endpoint(
+ name="responsive",
+ gpu=GpuGroup.ANY,
+ idle_timeout=30 # 30 seconds
+)
+def process(data): ...
+```
+
+#### Use CPU workers for non-GPU tasks
+
+For data preprocessing, postprocessing, or other tasks that don't require GPU acceleration, use CPU workers instead of GPU workers.
+
+```python
+from runpod_flash import Endpoint
+
+# CPU configuration for non-GPU tasks
+@Endpoint(
+ name="data-processor",
+ cpu="cpu5c-2-4" # 2 vCPU, 4GB RAM
+)
+def process_data(data): ...
+```
+
+#### Limit maximum workers
+
+Set `workers` to prevent runaway scaling and unexpected costs:
+
+```python
+@Endpoint(
+ name="controlled-scaling",
+ gpu=GpuGroup.ANY,
+ workers=3 # Limit to 3 concurrent workers (same as workers=(0, 3))
+)
+def process(data): ...
+```
+
+### Monitoring costs
+
+Monitor your usage in the [Runpod console](https://www.runpod.io/console/serverless) to track:
+
+- Total compute time across endpoints.
+- Worker utilization and idle time.
+- Cost breakdown by endpoint.
+
+## Next steps
+
+- [Create endpoint functions](/flash/create-endpoints) with optimized configurations.
+- [View Serverless pricing details](/serverless/pricing) for current rates.
+- [Configure resources](/flash/configuration/parameters) for your workloads.
diff --git a/flash/quickstart.mdx b/flash/quickstart.mdx
new file mode 100644
index 00000000..e03a09a2
--- /dev/null
+++ b/flash/quickstart.mdx
@@ -0,0 +1,391 @@
+---
+title: "Get started with Flash"
+sidebarTitle: "Quickstart"
+description: "Run your first GPU workload with Flash in less than 5 minutes."
+---
+
+
+Flash is currently in beta. [Join our Discord](https://discord.gg/cUpRmau42V) to provide feedback and get support.
+
+
+This quickstart gets you running GPU workloads on Runpod in minutes. You'll execute a function on a remote GPU and see the results immediately.
+
+## Requirements
+
+- [Runpod account](/get-started/manage-accounts).
+- [An API key](/get-started/api-keys) with **All** access permissions to your Runpod account.
+- [Python 3.10+](https://www.python.org/downloads/) installed.
+
+## Step 1: Install Flash
+
+
+Flash is currently available for macOS and Linux. Windows support is in development.
+
+
+Create a virtual environment and install Flash using [uv](https://docs.astral.sh/uv/):
+
+```bash
+uv venv
+source .venv/bin/activate
+uv pip install runpod-flash
+```
+
+### Optional: Install coding agent integration
+
+If you're using an AI coding agent like Claude Code, Cline, or Cursor, you can install the Flash skill package to give your agent detailed context about the Flash SDK:
+
+```bash
+npx skills add runpod/skills
+```
+
+This enables your coding agent to provide more accurate Flash code suggestions and troubleshooting help.
+
+## Step 2: Authenticate with Runpod
+
+Log in to your Runpod account:
+
+```bash
+flash login
+```
+
+This opens your browser to authorize Flash. After you approve, your credentials are saved, allowing you to run Flash commands and scripts.
+
+
+Alternatively, you can set the `RUNPOD_API_KEY` environment variable or add it to a `.env` file. See [`flash login`](/flash/cli/login) for details.
+
+
+## Step 3: Copy this code
+
+Create a file called `gpu_demo.py` and paste this code into it:
+
+```python
+import asyncio
+from runpod_flash import Endpoint, GpuType
+
+@Endpoint(
+ name="flash-quickstart",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
+ workers=3,
+ dependencies=["numpy", "torch"]
+)
+def gpu_matrix_multiply(size):
+ # IMPORTANT: Import packages INSIDE the function
+ import numpy as np
+ import torch
+
+ # Get GPU name
+ device_name = torch.cuda.get_device_name(0)
+
+ # Create random matrices
+ A = np.random.rand(size, size)
+ B = np.random.rand(size, size)
+
+ # Multiply matrices
+ C = np.dot(A, B)
+
+ return {
+ "matrix_size": size,
+ "result_mean": float(np.mean(C)),
+ "gpu": device_name
+ }
+
+# Call the function
+async def main():
+ print("Running matrix multiplication on Runpod GPU...")
+ result = await gpu_matrix_multiply(1000)
+
+ print(f"\n✓ Matrix size: {result['matrix_size']}x{result['matrix_size']}")
+ print(f"✓ Result mean: {result['result_mean']:.4f}")
+ print(f"✓ GPU used: {result['gpu']}")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+## Step 4: Run it
+
+Execute the script:
+
+```bash
+python gpu_demo.py
+```
+
+You'll see Flash provision a GPU worker and execute your function:
+
+```text
+Running matrix multiplication on Runpod GPU...
+Creating endpoint: flash-quickstart
+Provisioning Serverless endpoint...
+Endpoint ready
+Executing function on RunPod endpoint ID: xvf32dan8rcilp
+Initial job status: IN_QUEUE
+Job completed, output received
+
+✓ Matrix size: 1000x1000
+✓ Result mean: 249.8286
+✓ GPU used: NVIDIA GeForce RTX 4090
+```
+
+The first run takes 30-60 seconds, while Runpod provisions the endpoint, installs dependencies, and starts a worker. Subsequent runs take 2-3 seconds (because the worker is already running).
+
+
+Try running the script again immediately and notice how much faster it is. Flash reuses the same endpoint and cached dependencies. You can even update the code and run it again to see the changes take effect instantly.
+
+
+## Step 5: Understand what you just did
+
+Let's break down the code you just ran:
+
+### Imports and setup
+
+```python
+import asyncio
+from runpod_flash import Endpoint, GpuType
+```
+
+- **`asyncio`**: Enables asynchronous execution (endpoint functions run async).
+- **`Endpoint`**: The class that marks functions for remote execution.
+- **`GpuType`**: Enum for selecting specific GPU types.
+
+Flash automatically loads your credentials from `flash login` or the `RUNPOD_API_KEY` environment variable.
+
+### The `@Endpoint` decorator
+
+```python
+@Endpoint(
+ name="flash-quickstart",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
+ workers=3,
+ dependencies=["numpy", "torch"]
+)
+def gpu_matrix_multiply(size):
+ import numpy as np
+ import torch
+
+ # Get GPU name
+ device_name = torch.cuda.get_device_name(0)
+
+ # Create random matrices
+ A = np.random.rand(size, size)
+ B = np.random.rand(size, size)
+
+ # Multiply matrices
+ C = np.dot(A, B)
+
+ return {
+ "matrix_size": size,
+ "result_mean": float(np.mean(C)),
+ "gpu": device_name
+ }
+```
+
+The `@Endpoint` decorator configures everything in one place:
+
+- **`name`**: Identifies your endpoint in the [Runpod console](https://www.runpod.io/console/serverless).
+- **`gpu`**: Which GPU type to use (here: RTX 4090 with 24GB VRAM).
+- **`workers`**: Maximum parallel workers (allows 3 concurrent executions).
+- **`dependencies`**: Python packages to install on the worker.
+- **Function body**: The matrix multiplication code runs on the remote GPU, not your local machine.
+- **Return value**: The result is returned to your local machine as a Python dictionary.
+
+See [GPU types](/flash/configuration/gpu-types) for available GPUs or [endpoint functions](/flash/create-endpoints) for all configuration options.
+
+
+You must import packages **inside the function body**, not at the top of your file. These imports need to happen on the remote worker.
+
+
+### Calling the function
+
+```python
+async def main():
+ print("Running matrix multiplication on Runpod GPU...")
+ result = await gpu_matrix_multiply(1000)
+
+ print(f"\n✓ Matrix size: {result['matrix_size']}x{result['matrix_size']}")
+ print(f"✓ Result mean: {result['result_mean']:.4f}")
+ print(f"✓ GPU used: {result['gpu']}")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+Here's what happens when you call an `@Endpoint` decorated function:
+
+1. Flash checks if the endpoint specified in your decorator already exists.
+ - If yes: It updates the endpoint if the configuration has changed.
+ - If no: It creates a new endpoint, initializes a worker, and installs your dependencies.
+2. Flash sends your code to the GPU worker
+3. The GPU worker executes the function with the provided inputs.
+4. The result is returned to your local machine as a Python dictionary, where it's printed in your terminal.
+
+Everything outside the `@Endpoint` function (all the `print` statements, etc.) runs **locally on your machine**. Only the decorated function runs remotely.
+
+## Step 6: Run multiple operations in parallel
+
+Flash makes it easy to run multiple GPU operations concurrently. Replace your `main()` function with the code below:
+
+```python
+async def main():
+ print("Running 3 matrix operations in parallel...")
+
+ # Run all three operations at once
+ results = await asyncio.gather(
+ gpu_matrix_multiply(500),
+ gpu_matrix_multiply(1000),
+ gpu_matrix_multiply(2000)
+ )
+
+ # Print results
+ for i, result in enumerate(results, 1):
+ print(f"\n{i}. Size: {result['matrix_size']}x{result['matrix_size']}")
+ print(f" Mean: {result['result_mean']:.4f}")
+ print(f" GPU: {result['gpu']}")
+```
+
+Run the script again:
+
+```bash
+python gpu_demo.py
+```
+
+All three operations execute simultaneously:
+
+```text
+Running 3 matrix operations in parallel...
+Initial job status: IN_QUEUE
+Initial job status: IN_QUEUE
+Initial job status: IN_QUEUE
+Job completed, output received
+Job completed, output received
+Job completed, output received
+
+1. Size: 500x500
+ Mean: 125.3097
+ GPU: NVIDIA GeForce RTX 4090
+
+2. Size: 1000x1000
+ Mean: 249.9442
+ GPU: NVIDIA GeForce RTX 4090
+
+3. Size: 2000x2000
+ Mean: 500.1321
+ GPU: NVIDIA GeForce RTX 4090
+```
+
+## Clean up
+
+When you're done testing, clean up the endpoints:
+
+```bash
+# List all endpoints
+flash undeploy list
+
+# Remove the quickstart endpoint
+flash undeploy flash-quickstart
+
+# Or remove all endpoints
+flash undeploy --all
+```
+
+## Next steps
+
+You've successfully run GPU code on Runpod! Now you're ready to learn more about Flash:
+
+
+
+ Use Stable Diffusion XL to generate images from text prompts.
+
+
+ Learn how to configure and optimize endpoint functions.
+
+
+ Deploy production APIs.
+
+
+ Browse example Flash scripts and apps on GitHub.
+
+
+
+## Troubleshooting
+
+### Authentication error
+
+```text
+Error: API key is not set
+```
+
+**Solution**: Run `flash login` to authenticate with your Runpod account:
+
+```bash
+flash login
+```
+
+Alternatively, set the `RUNPOD_API_KEY` environment variable:
+
+```bash
+export RUNPOD_API_KEY=your_key
+```
+
+### Template name conflict
+
+```text
+Error: endpoint template names must be unique
+```
+
+**Solution**: Each endpoint needs a unique `name`. If you've deployed an endpoint before with the same name, either:
+- Use a different name for your new endpoint
+- Undeploy the existing endpoint with `flash undeploy --force`
+
+### Job stuck in queue
+
+```text
+Initial job status: IN_QUEUE
+[Stays in queue for >60 seconds]
+```
+
+**Solution**: No GPUs available. Use `GpuGroup.ANY` to accept any available GPU:
+
+```python
+@Endpoint(
+ name="flash-quickstart",
+ gpu=GpuGroup.ANY,
+ dependencies=["numpy", "torch"]
+)
+def gpu_matrix_multiply(size):
+ ...
+```
+
+Or add multiple specific GPU types for fallback:
+
+```python
+@Endpoint(
+ name="flash-quickstart",
+ gpu=[
+ GpuType.NVIDIA_GEFORCE_RTX_4090,
+ GpuType.NVIDIA_RTX_A5000,
+ GpuType.NVIDIA_RTX_A6000
+ ],
+ dependencies=["numpy", "torch"]
+)
+def gpu_matrix_multiply(size):
+ ...
+```
+
+You can also check [GPU availability](https://www.runpod.io/console/serverless) in the console.
+
+### Import errors
+
+```text
+ModuleNotFoundError: No module named 'numpy'
+```
+
+**Solution**: Move imports inside the `@Endpoint` function:
+
+```python
+@Endpoint(name="compute", gpu=GpuGroup.ANY, dependencies=["numpy"])
+def my_function():
+ import numpy as np # Import here, not at top of file
+ # ...
+```
+
+See the [execution model](/flash/execution-model#common-execution-issues) for more troubleshooting.
diff --git a/flash/troubleshooting.mdx b/flash/troubleshooting.mdx
new file mode 100644
index 00000000..1fe928d4
--- /dev/null
+++ b/flash/troubleshooting.mdx
@@ -0,0 +1,429 @@
+---
+title: "Troubleshooting"
+sidebarTitle: "Troubleshooting"
+description: "Monitor, debug, and troubleshoot Flash deployments."
+---
+
+This guide covers how to monitor your Flash deployments, debug issues, and resolve common errors.
+
+## Monitoring and debugging
+
+### Viewing logs
+
+When running Flash functions, logs are displayed in your terminal:
+
+```text
+2025-11-19 12:35:15,109 | INFO | Created endpoint: rb50waqznmn2kg - flash-quickstart-fb
+2025-11-19 12:35:15,114 | INFO | Endpoint:rb50waqznmn2kg | API /run
+2025-11-19 12:35:15,655 | INFO | Endpoint:rb50waqznmn2kg | Started Job:b0b341e7-...
+2025-11-19 12:35:15,762 | INFO | Job:b0b341e7-... | Status: IN_QUEUE
+2025-11-19 12:36:09,983 | INFO | Job:b0b341e7-... | Status: COMPLETED
+2025-11-19 12:36:10,068 | INFO | Worker:icmkdgnrmdf8gz | Delay Time: 51842 ms
+2025-11-19 12:36:10,068 | INFO | Worker:icmkdgnrmdf8gz | Execution Time: 1533 ms
+```
+
+Control log verbosity with the `LOG_LEVEL` environment variable:
+
+```bash
+LOG_LEVEL=DEBUG python your_script.py
+```
+
+Available levels: `DEBUG`, `INFO`, `WARNING`, `ERROR`.
+
+### Runpod console
+
+View detailed metrics and logs in the [Runpod console](https://www.runpod.io/console/serverless):
+
+1. Navigate to the **Serverless** section.
+2. Click on your endpoint to view:
+ - Active workers and queue depth.
+ - Request history and job status.
+ - Worker logs and execution details.
+
+The console provides metrics including request rate, queue depth, latency, worker count, and error rate.
+
+### View worker logs
+
+Access detailed logs for specific workers:
+
+1. Go to the [Serverless console](https://www.runpod.io/console/serverless).
+2. Select your endpoint.
+3. Click on a worker to view its logs.
+
+Logs include dependency installation output, function execution output (print statements, errors), and system-level messages.
+
+### Add logging to functions
+
+Include print statements in your endpoint functions for debugging:
+
+```python
+@Endpoint(name="processor", gpu=GpuGroup.ANY)
+async def process(data: dict) -> dict:
+ print(f"Received data: {data}") # Visible in worker logs
+
+ result = do_processing(data)
+ print(f"Processing complete: {result}")
+
+ return result
+```
+
+## Configuration errors
+
+### API key not set
+
+**Error:**
+```
+RUNPOD_API_KEY environment variable is required but not set
+```
+
+**Cause:** Flash requires a valid Runpod API key to provision and manage endpoints.
+
+**Solution:**
+
+1. Generate an API key from [Settings > API Keys](https://www.runpod.io/console/user/settings) in the Runpod console. The key needs **All** access permissions.
+
+2. Set the key using one of these methods:
+
+ **Option 1: Environment variable**
+ ```bash
+ export RUNPOD_API_KEY=your_api_key
+ ```
+
+ **Option 2: .env file in your project root**
+ ```bash
+ echo "RUNPOD_API_KEY=your_api_key" > .env
+ ```
+
+ **Option 3: Shell profile (~/.bashrc or ~/.zshrc)**
+ ```bash
+ echo 'export RUNPOD_API_KEY=your_api_key' >> ~/.bashrc
+ source ~/.bashrc
+ ```
+
+### Invalid route configuration
+
+**Error:**
+```
+Load-balanced endpoints require route decorators
+```
+
+**Cause:** Load-balanced endpoints require HTTP method decorators for each route.
+
+**Solution:** Ensure all routes use the correct decorator pattern:
+
+```python
+from runpod_flash import Endpoint
+
+api = Endpoint(name="api", cpu="cpu5c-4-8", workers=(1, 5))
+
+# Correct - using route decorators
+@api.post("/process")
+async def process_data(data: dict) -> dict:
+ return {"result": "processed"}
+
+@api.get("/health")
+async def health_check() -> dict:
+ return {"status": "healthy"}
+```
+
+### Invalid HTTP method
+
+**Error:**
+```
+method must be one of {'GET', 'POST', 'PUT', 'DELETE', 'PATCH'}
+```
+
+**Cause:** The HTTP method specified is not supported.
+
+**Solution:** Use one of the supported HTTP methods: `GET`, `POST`, `PUT`, `DELETE`, or `PATCH`.
+
+### Invalid path format
+
+**Error:**
+```
+path must start with '/'
+```
+
+**Cause:** HTTP paths must begin with a forward slash.
+
+**Solution:** Ensure paths start with `/`:
+
+```python
+# Correct
+@api.get("/health")
+
+# Incorrect
+@api.get("health")
+```
+
+### Duplicate routes
+
+**Error:**
+```
+Duplicate route 'POST /process' in endpoint 'my-api'
+```
+
+**Cause:** Two functions define the same HTTP method and path combination.
+
+**Solution:** Ensure each route is unique within an endpoint. Either change the path or method of one function.
+
+## Deployment errors
+
+### Tarball too large
+
+**Error:**
+```
+Tarball exceeds maximum size. File size: 512.5MB, Max: 500MB
+```
+
+**Cause:** The deployment package exceeds the 500MB limit.
+
+**Solution:**
+
+1. Check for large files that shouldn't be included (datasets, model weights, logs).
+2. Add large files to `.flashignore` to exclude them from the build.
+3. Use [network volumes](/flash/configuration/storage) to store large models instead of bundling them.
+
+### Invalid tarball format
+
+**Error:**
+```
+File is not a valid gzip file. Expected magic bytes (31, 139)
+```
+
+**Cause:** The build artifact is corrupted or not a valid gzip file.
+
+**Solution:** Delete the `.flash` directory and rebuild:
+
+```bash
+rm -rf .flash
+flash build
+```
+
+### Resource provisioning failed
+
+**Error:**
+```
+Failed to provision resources: [error details]
+```
+
+**Cause:** Flash couldn't create the Serverless endpoint on Runpod.
+
+**Solutions:**
+
+1. **Check GPU availability**: The requested GPU types may not be available. Add fallback options:
+ ```python
+ gpu=[GpuType.NVIDIA_A100_80GB_PCIe, GpuType.NVIDIA_RTX_A6000, GpuType.NVIDIA_GEFORCE_RTX_4090]
+ ```
+
+2. **Check account limits**: You may have hit worker capacity limits. Contact [Runpod support](https://www.runpod.io/contact) to increase limits.
+
+3. **Check network volume**: If using `volume=`, verify the volume exists and is in a compatible datacenter.
+
+## Runtime errors
+
+### Endpoint not deployed
+
+**Error:**
+```
+Endpoint URL not available - endpoint may not be deployed
+```
+
+**Cause:** The endpoint function was called before the endpoint finished provisioning.
+
+**Solutions:**
+
+1. **For standalone scripts**: Ensure the endpoint has time to provision. Flash handles this automatically, but network issues can cause delays.
+
+2. **For Flash apps**: Deploy the app first with `flash deploy`, then call the endpoint.
+
+3. **Check endpoint status**: View your endpoints in the [Serverless console](https://www.runpod.io/console/serverless).
+
+### Execution timeout
+
+**Error:**
+```
+Execution timeout on [endpoint] after [N]s
+```
+
+**Cause:** The endpoint function took longer than the configured timeout.
+
+**Solutions:**
+
+1. **Increase timeout**: Set `execution_timeout_ms` in your configuration:
+ ```python
+ @Endpoint(
+ name="long-running",
+ gpu=GpuType.NVIDIA_A100_80GB_PCIe,
+ execution_timeout_ms=600000 # 10 minutes
+ )
+ ```
+
+2. **Optimize function**: Profile your function to identify bottlenecks.
+
+3. **Use queue-based endpoints**: For long-running tasks, use the `@Endpoint` decorator pattern. Queue-based endpoints are designed for longer operations.
+
+### Connection failed
+
+**Error:**
+```
+Failed to connect to endpoint [name] ([url])
+```
+
+**Cause:** Network connectivity issue between your local environment and the Runpod endpoint.
+
+**Solutions:**
+
+1. **Check internet connection**: Verify you have network access.
+2. **Retry**: Transient network issues often resolve on retry. Flash includes automatic retry logic.
+3. **Check endpoint status**: Verify the endpoint is running in the [Serverless console](https://www.runpod.io/console/serverless).
+
+### HTTP errors from endpoint
+
+**Error:**
+```
+HTTP error from endpoint [name]: 500 - Internal Server Error
+```
+
+**Cause:** The endpoint function raised an exception during execution.
+
+**Solutions:**
+
+1. **Check logs**: View worker logs in the [Serverless console](https://www.runpod.io/console/serverless) for detailed error messages.
+
+2. **Test locally**: Use `flash run` to test your function locally before deploying.
+
+3. **Add error handling**: Wrap your function logic in try/except to provide better error messages:
+ ```python
+ @Endpoint(name="processor", gpu=GpuGroup.ANY)
+ async def process(data: dict) -> dict:
+ try:
+ # Your logic here
+ return {"result": "success"}
+ except Exception as e:
+ return {"error": str(e)}
+ ```
+
+### Serialization errors
+
+**Error:**
+```
+Failed to deserialize result: [error]
+```
+
+**Cause:** The function's return value cannot be serialized/deserialized.
+
+**Solutions:**
+
+1. **Use simple types**: Return dictionaries, lists, strings, numbers, and other JSON-serializable types.
+
+2. **Avoid complex objects**: Don't return PyTorch tensors, NumPy arrays, or custom classes directly. Convert them first:
+ ```python
+ # Correct
+ return {"result": tensor.tolist()}
+
+ # Incorrect - tensor is not serializable
+ return {"result": tensor}
+ ```
+
+3. **Check argument types**: Input arguments must also be serializable.
+
+### Circuit breaker open
+
+**Error:**
+```
+Circuit breaker is open. Retry in [N] seconds
+```
+
+**Cause:** Too many consecutive failures to the endpoint triggered the circuit breaker protection.
+
+**Solutions:**
+
+1. **Wait and retry**: The circuit breaker will automatically attempt recovery after the timeout (typically 60 seconds).
+
+2. **Check endpoint health**: Multiple failures usually indicate an underlying issue. Check logs and endpoint status.
+
+3. **Fix the root cause**: Address whatever is causing the repeated failures before retrying.
+
+## GPU availability issues
+
+### Job stuck in queue
+
+**Symptom:** Job status shows `IN_QUEUE` for extended periods.
+
+**Cause:** The requested GPU types are not available.
+
+**Solutions:**
+
+1. **Add fallback GPUs**: Expand your `gpu` list with additional options:
+ ```python
+ @Endpoint(
+ name="flexible",
+ gpu=[
+ GpuType.NVIDIA_A100_80GB_PCIe, # First choice
+ GpuType.NVIDIA_RTX_A6000, # Fallback
+ GpuType.NVIDIA_GEFORCE_RTX_4090 # Second fallback
+ ]
+ )
+ ```
+
+2. **Use GpuGroup.ANY**: For development, accept any available GPU:
+ ```python
+ gpu=GpuGroup.ANY
+ ```
+
+3. **Check availability**: View GPU availability in the [Serverless console](https://www.runpod.io/console/serverless).
+
+4. **Contact support**: For guaranteed capacity, contact [Runpod support](https://www.runpod.io/contact).
+
+## Dependency errors
+
+### Module not found
+
+**Error (in worker logs):**
+```
+ModuleNotFoundError: No module named 'transformers'
+```
+
+**Cause:** A required dependency was not specified in the `@Endpoint` decorator.
+
+**Solution:** Add all required packages to the `dependencies` parameter:
+
+```python
+@Endpoint(
+ name="processor",
+ gpu=GpuGroup.ANY,
+ dependencies=["transformers", "torch", "pillow"]
+)
+async def process(data: dict) -> dict:
+ from transformers import pipeline
+ # ...
+```
+
+### Version conflicts
+
+**Symptom:** Function fails with import errors or unexpected behavior.
+
+**Cause:** Dependency version conflicts between packages.
+
+**Solution:** Pin specific versions:
+
+```python
+@Endpoint(
+ name="processor",
+ gpu=GpuGroup.ANY,
+ dependencies=[
+ "transformers==4.36.0",
+ "torch==2.1.0",
+ "accelerate>=0.25.0"
+ ]
+)
+```
+
+## Getting help
+
+If you're still stuck:
+
+1. **Discord**: Join the [Runpod Discord](https://discord.gg/cUpRmau42V) for community support.
+2. **GitHub Issues**: Report bugs or request features on the [Flash repository](https://github.com/runpod/flash).
+3. **Support**: Contact [Runpod support](https://www.runpod.io/contact) for account-specific issues.
diff --git a/get-started.mdx b/get-started.mdx
index e81fb20d..fb7575ff 100644
--- a/get-started.mdx
+++ b/get-started.mdx
@@ -88,12 +88,26 @@ To learn more about how storage works, see the [Pod storage overview](/pods/stor
Now that you've learned the basics, you're ready to:
-* [Generate API keys](/get-started/api-keys) for programmatic resource management.
-* [Manage your account](/get-started/manage-accounts) to create teams and invite collaborators.
-* Learn how to [choose the right Pod](/pods/choose-a-pod) for your workload.
-* Review options for [Pod pricing](/pods/pricing).
-* [Explore our tutorials](/tutorials/introduction/overview) for specific AI/ML use cases.
-* Start building production-ready applications with [Runpod Serverless](/serverless/overview).
+
+
+ Create API keys for programmatic resource management.
+
+
+ Create teams and invite collaborators.
+
+
+ Learn how to select the best Pod for your workload.
+
+
+ Review pricing options for Pods.
+
+
+ Follow step-by-step guides for specific AI/ML use cases.
+
+
+ Start building production-ready applications.
+
+
## Need help?
diff --git a/get-started/concepts.mdx b/get-started/concepts.mdx
index cbf66686..af6e1bf5 100644
--- a/get-started/concepts.mdx
+++ b/get-started/concepts.mdx
@@ -11,6 +11,10 @@ The web interface for managing your compute resources, account, teams, and billi
A pay-as-you-go compute solution designed for dynamic autoscaling in production AI/ML apps.
+## [Flash](/flash/overview)
+
+A framework for building distributed GPU applications using local Python scripts. Write functions with the `@remote` decorator, and Flash automatically executes them on Runpod's infrastructure.
+
## [Pod](/pods/overview)
A dedicated GPU or CPU instance for containerized AI/ML workloads, such as training models, running inference, or other compute-intensive tasks.
diff --git a/get-started/manage-accounts.mdx b/get-started/manage-accounts.mdx
index 37e76386..5f409efb 100644
--- a/get-started/manage-accounts.mdx
+++ b/get-started/manage-accounts.mdx
@@ -120,12 +120,3 @@ When managing team accounts, establish clear role assignments based on each memb
For enhanced security, use the principle of least privilege by assigning the minimum role necessary for each team member's work. Consider creating separate accounts for billing management to isolate financial access from technical operations.
Monitor audit logs periodically to ensure compliance with your organization's policies and identify any unauthorized activities early.
-
-## Next steps
-
-After setting up your account and team you can:
-
-* [Create API keys](/get-started/api-keys) to enable programmatic access to Runpod services.
-* [Deploy your first Pod](/get-started) to start using GPU resources.
-* Configure [Serverless endpoints](/serverless/overview) for scalable AI .
-* Set up [billing and payment methods](https://console.runpod.io/user/billing) for your team.
\ No newline at end of file
diff --git a/images/flash_sdxl_output.png b/images/flash_sdxl_output.png
new file mode 100644
index 00000000..07dbbe29
Binary files /dev/null and b/images/flash_sdxl_output.png differ
diff --git a/instant-clusters.mdx b/instant-clusters.mdx
index 1e7e101b..960fa955 100644
--- a/instant-clusters.mdx
+++ b/instant-clusters.mdx
@@ -2,10 +2,13 @@
title: "Overview"
sidebarTitle: "Overview"
description: "Fully managed compute clusters for multi-node training and AI inference."
+mode: "wide"
---
import { DataCenterTooltip, PyTorchTooltip, TrainingTooltip, InferenceTooltip, SlurmTooltip, TensorFlowTooltip } from "/snippets/tooltips.jsx";
+
+
Runpod Instant Clusters provide fully managed compute clusters with high-performance networking for distributed workloads. Deploy multi-node jobs or large-scale AI without managing infrastructure, networking, or cluster configuration.
## Why use Instant Clusters?
@@ -29,32 +32,18 @@ Instant Clusters offer distributed computing power beyond the capabilities of si
Choose the deployment guide that matches your preferred framework and use case:
-
-
- Set up a managed Slurm cluster for high-performance computing workloads. Slurm provides job scheduling, resource allocation, and queue management for research environments and batch processing workflows.
-
-
- Set up multi-node PyTorch training for deep learning models. This tutorial covers distributed data parallel training, gradient synchronization, and performance optimization techniques.
+
+
+ Set up a managed Slurm cluster for high-performance computing workloads.
-
- Use Axolotl's framework for fine-tuning large language models across multiple GPUs. This approach simplifies customizing pre-trained models like Llama or Mistral with built-in training optimizations.
+
+ Set up multi-node PyTorch training for deep learning models.
-
- For advanced users who need full control over Slurm configuration. This option provides a basic Slurm installation that you can customize for specialized workloads.
+
+ Use Axolotl's framework for fine-tuning large language models across multiple GPUs.
-You can also follow this [video tutorial](https://www.youtube.com/watch?v=k_5rwWyxo5s?si=r3lZclHcoY3HJYyg) to learn how to deploy Kimi K2 using Instant Clusters.
-
-
-
## How it works
When you deploy an Instant Cluster, Runpod provisions multiple GPU nodes within the same and connects them with high-speed networking. One node is designated as the primary node, and all nodes receive pre-configured environment variables for distributed communication.
@@ -127,4 +116,4 @@ All accounts have a default spending limit. To deploy a larger cluster, submit a
Runpod offers custom Instant Cluster pricing plans for large scale and enterprise workloads. If you're interested in learning more, [contact our sales team](https://ecykq.share.hsforms.com/2MZdZATC3Rb62Dgci7knjbA).
-
+
\ No newline at end of file
diff --git a/overview.mdx b/overview.mdx
index 18d8119b..2c11bc39 100644
--- a/overview.mdx
+++ b/overview.mdx
@@ -2,131 +2,134 @@
title: "Welcome to Runpod"
description: "Explore our guides and examples to deploy your AI/ML application on Runpod."
sidebarTitle: "Welcome"
+mode: "wide"
---
+
import { TrainingTooltip, FineTuningTooltip, InferenceTooltip } from "/snippets/tooltips.jsx";
-Runpod is a cloud computing platform built for AI, machine learning, and general compute needs. Whether you're or AI models, or deploying cloud-based applications for , Runpod provides scalable, high-performance GPU and CPU resources to power your workloads.
+
-## Get started
+Runpod is a cloud computing platform built for AI, machine learning, and general compute needs. Whether you're or AI models, or deploying cloud-based applications for , Runpod provides scalable, high-performance GPU and CPU resources to power your workloads.
-If you're new to Runpod, start here to learn the essentials and deploy your first GPU.
+## Access GPUs instantly
-
-
+
+
Create an account, deploy your first GPU Pod, and use it to execute code.
-
+
+ Create API keys to manage your access to Runpod resources.
+
+
Learn about the key concepts and terminology for the Runpod platform.
-
- Create API keys to manage your access to Runpod resources.
+
+ Run Python functions on remote GPUs directly from your local terminal.
-
- Connect your AI tools to Runpod's MCP servers to manage resources and access docs.
+
+ Pay-per-second computing with automatic scaling for production AI/ML apps.
+
+
+ Dedicated GPU or CPU instances for containerized AI/ML workloads.
-You can also watch this video for a high-level overview of our products:
+## Use our model endpoints
-
+Runpod offers [Public Endpoints](/public-endpoints/overview) for instant API access to pre-deployed AI models for image, video, audio, and text generation. No deployment or infrastructure required—just [create an API key](/get-started/api-keys) and make a request:
-## Serverless
+
-Serverless provides pay-per-second computing with automatic scaling for production AI/ML apps. You only pay for actual compute time when your code runs, with no idle costs, making Serverless ideal for variable workloads and cost-efficient production deployments.
+```python Python
+import requests
-
-
- Learn how Serverless works and how to deploy pre-configured endpoints.
-
-
- Learn how Serverless billing works and how to optimize your costs.
-
-
- Write a handler function, build a worker image, create an endpoint, and send your first request.
-
-
- Deploy a large language model for text or image generation in minutes using vLLM.
-
+response = requests.post(
+ "https://api.runpod.ai/v2/black-forest-labs-flux-1-schnell/runsync",
+ headers={
+ "Authorization": "Bearer YOUR_API_KEY", # Replace YOUR_API_KEY with your actual API key
+ "Content-Type": "application/json"
+ },
+ json={
+ "input": {
+ "prompt": "A beautiful sunset over mountains", # Customize your prompt
+ "width": 1024,
+ "height": 1024
+ }
+ }
+)
-
+result = response.json()
+print(result["output"]["image_url"])
+```
-## Pods
+```bash cURL
+# Replace YOUR_API_KEY with your actual API key
+curl -X POST "https://api.runpod.ai/v2/black-forest-labs-flux-1-schnell/runsync" \
+ -H "Authorization: Bearer YOUR_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{
+ "input": {
+ "prompt": "A beautiful sunset over mountains",
+ "width": 1024,
+ "height": 1024
+ }
+ }'
+```
-Pods give you dedicated GPU or CPU instances for containerized AI/ML workloads. Pods are billed by the minute and stay available as long as you keep them running, making them perfect for development, training, and workloads that need continuous access.
+
-
-
- Understand the components of a Pod and options for configuration.
-
-
- Learn about Pod pricing options and how to optimize your costs.
-
-
- Learn how to choose the right Pod for your workload.
-
-
- Learn how to deploy a Pod with ComfyUI pre-installed and start generating images.
-
-
-
-## Public Endpoints
+For a list of available models, see the [model reference](/public-endpoints/reference).
-Public Endpoints provide instant API access to pre-deployed AI models for image, video, audio, and text generation. No deployment or infrastructure required. You only pay for what you generate, making it easy to integrate AI capabilities into your applications.
+## Guides and examples
-
-
- Learn how Public Endpoints work and when to use them.
+
+
+ Deploy a dedicated GPU with ComfyUI pre-installed and start generating images.
-
- Generate your first image in under 5 minutes.
+
+ Build a ComfyUI worker and deploy it as a Serverless endpoint.
-
- Use the playground, REST API, and SDKs.
+
+ Use a hybrid local/remote script to generate images with SDXL.
-
- Browse available models and their parameters.
+
+ Create a multi-model pipeline for video generation.
+
+
+ Create a REST API with automatic load balancing using Flash.
+
+
+ Deploy a large language model in minutes using vLLM on Serverless.
-## Instant Clusters
+## High-performance clusters
-Instant Clusters deliver fully managed multi-node compute clusters with high-speed networking (up to 3200 Gbps) for distributed workloads. Run multi-node training, fine-tune large language models, or scale inference across multiple GPUs working in parallel.
+Create a multi-node [Instant Cluster](/instant-clusters) for fully managed distributed GPU computing with high-speed networking between nodes.
-
-
+
+
Learn how Instant Clusters work and when to use them.
-
- Environment variables, network interfaces, and NCCL configuration.
-
-
+
Set up managed Slurm for HPC workloads.
-
+
Run distributed PyTorch training across multiple nodes.
## Support
-
-
+
+
Submit a support request using our contact page.
-
- Email help@runpod.io for direct support.
-
-
+
Check the status of Runpod services and infrastructure.
-
+
Join the Runpod community on Discord.
+
diff --git a/public-endpoints/models/chatterbox-turbo.mdx b/public-endpoints/models/chatterbox-turbo.mdx
index 97cd809d..6073bc3f 100644
--- a/public-endpoints/models/chatterbox-turbo.mdx
+++ b/public-endpoints/models/chatterbox-turbo.mdx
@@ -6,7 +6,7 @@ description: "Fast open-source text-to-speech with expressive voice cloning and
Chatterbox Turbo is Resemble AI's fastest open-source text-to-speech model with paralinguistic tags for non-speech sounds and expressive voice cloning capabilities. It supports multiple preset voices and custom voice cloning via audio URL.
-
+
Test Chatterbox Turbo in the Runpod Hub playground.
diff --git a/public-endpoints/models/cogito-671b.mdx b/public-endpoints/models/cogito-671b.mdx
index 25cadf1d..2e81312c 100644
--- a/public-endpoints/models/cogito-671b.mdx
+++ b/public-endpoints/models/cogito-671b.mdx
@@ -6,7 +6,7 @@ description: "Deep Cogito's 671B parameter Mixture-of-Experts model with FP8 dyn
Cogito 671B v2.1 is Deep Cogito's massive 671B parameter Mixture-of-Experts (MoE) language model. It uses FP8 dynamic quantization for efficient inference while maintaining high-quality outputs across reasoning, coding, and general knowledge tasks.
-
+
Test Cogito 671B v2.1 in the Runpod Hub playground.
diff --git a/public-endpoints/models/flux-dev.mdx b/public-endpoints/models/flux-dev.mdx
index 0d92cbfb..d1abb451 100644
--- a/public-endpoints/models/flux-dev.mdx
+++ b/public-endpoints/models/flux-dev.mdx
@@ -6,7 +6,7 @@ description: "High-quality image generation with exceptional prompt adherence an
Flux Dev is Black Forest Labs' flagship image generation model, optimized for high visual fidelity and detailed outputs. It offers exceptional prompt adherence and produces images with rich detail, making it ideal for production use cases.
-
+
Test Flux Dev in the Runpod Hub playground.
diff --git a/public-endpoints/models/flux-kontext-dev.mdx b/public-endpoints/models/flux-kontext-dev.mdx
index 953ec610..00d38efd 100644
--- a/public-endpoints/models/flux-kontext-dev.mdx
+++ b/public-endpoints/models/flux-kontext-dev.mdx
@@ -6,7 +6,7 @@ description: "12 billion parameter model for editing images based on text instru
Flux Kontext Dev is a 12 billion parameter rectified flow transformer capable of editing images based on text instructions. It excels at making targeted edits to existing images while preserving the overall context and style.
-
+
Test Flux Kontext Dev in the Runpod Hub playground.
diff --git a/public-endpoints/models/flux-schnell.mdx b/public-endpoints/models/flux-schnell.mdx
index d0a3f506..f12b2700 100644
--- a/public-endpoints/models/flux-schnell.mdx
+++ b/public-endpoints/models/flux-schnell.mdx
@@ -6,7 +6,7 @@ description: "Fast, lightweight image generation optimized for speed and prototy
Flux Schnell is Black Forest Labs' fastest and most lightweight FLUX model, ideal for local development, prototyping, and personal use. It generates images quickly with lower step counts while maintaining good quality.
-
+
Test Flux Schnell in the Runpod Hub playground.
diff --git a/public-endpoints/models/gpt-oss-120b.mdx b/public-endpoints/models/gpt-oss-120b.mdx
index 0a12fc4e..9d471ef7 100644
--- a/public-endpoints/models/gpt-oss-120b.mdx
+++ b/public-endpoints/models/gpt-oss-120b.mdx
@@ -6,7 +6,7 @@ description: "OpenAI's open-weight 120B parameter language model for advanced te
GPT-OSS 120B is OpenAI's open-weight 120B parameter language model, offering powerful text generation capabilities with advanced reasoning and instruction-following abilities.
-
+
Test GPT-OSS 120B in the Runpod Hub playground.
diff --git a/public-endpoints/models/granite-4.mdx b/public-endpoints/models/granite-4.mdx
index 3c4da7dc..cf4b0a15 100644
--- a/public-endpoints/models/granite-4.mdx
+++ b/public-endpoints/models/granite-4.mdx
@@ -6,7 +6,7 @@ description: "A 32B parameter long-context instruct model for text generation."
IBM Granite-4.0-H-Small is a 32B parameter long-context instruct model. It excels at general text generation, instruction following, and conversational AI tasks with support for extended context lengths.
-
+
Test IBM Granite 4.0 in the Runpod Hub playground.
diff --git a/public-endpoints/models/infinitetalk.mdx b/public-endpoints/models/infinitetalk.mdx
index 6d0a8afa..c64c03d9 100644
--- a/public-endpoints/models/infinitetalk.mdx
+++ b/public-endpoints/models/infinitetalk.mdx
@@ -6,7 +6,7 @@ description: "Audio-driven video generation that creates talking or singing vide
InfiniteTalk is an audio-driven video generation model that creates talking or singing videos from a single image and audio input. It animates faces and bodies to match the audio, making it ideal for creating talking head videos, virtual presenters, or lip-synced content.
-
+
Test InfiniteTalk in the Runpod Hub playground.
diff --git a/public-endpoints/models/kling-v2-1.mdx b/public-endpoints/models/kling-v2-1.mdx
index a87768af..b0092287 100644
--- a/public-endpoints/models/kling-v2-1.mdx
+++ b/public-endpoints/models/kling-v2-1.mdx
@@ -6,7 +6,7 @@ description: "Professional-grade image-to-video with enhanced visual fidelity."
Kling v2.1 I2V Pro is a professional-grade image-to-video model with enhanced visual fidelity. It generates high-quality videos from static images with smooth motion and excellent detail preservation.
-
+
Test Kling v2.1 I2V Pro in the Runpod Hub playground.
diff --git a/public-endpoints/models/kling-v2-6-motion-control.mdx b/public-endpoints/models/kling-v2-6-motion-control.mdx
index 23b6fa27..44a0dd3b 100644
--- a/public-endpoints/models/kling-v2-6-motion-control.mdx
+++ b/public-endpoints/models/kling-v2-6-motion-control.mdx
@@ -6,7 +6,7 @@ description: "Transfer motion from reference videos to animate still images."
Kling v2.6 Standard Motion Control transfers motion from reference videos to animate still images. Upload a character image and a motion clip, and the model extracts the movement to generate smooth video output.
-
+
Test Kling v2.6 Motion Control in the Runpod Hub playground.
diff --git a/public-endpoints/models/kling-video-o1-r2v.mdx b/public-endpoints/models/kling-video-o1-r2v.mdx
index 00930d71..d021ff2d 100644
--- a/public-endpoints/models/kling-video-o1-r2v.mdx
+++ b/public-endpoints/models/kling-video-o1-r2v.mdx
@@ -6,7 +6,7 @@ description: "Creative video generation using character, prop, or scene referenc
Kling Video O1 R2V generates creative videos using character, prop, or scene references from multiple viewpoints. It can combine multiple reference images to create coherent video content.
-
+
Test Kling Video O1 R2V in the Runpod Hub playground.
diff --git a/public-endpoints/models/minimax-speech.mdx b/public-endpoints/models/minimax-speech.mdx
index 3dea2be6..55f83646 100644
--- a/public-endpoints/models/minimax-speech.mdx
+++ b/public-endpoints/models/minimax-speech.mdx
@@ -6,7 +6,7 @@ description: "High-definition text-to-speech with emotional control and voice cu
Minimax Speech 02 HD is a high-definition text-to-speech model with emotional control and voice customization. It produces natural-sounding speech with adjustable speed, pitch, volume, and emotional tone.
-
+
Test Minimax Speech 02 HD in the Runpod Hub playground.
diff --git a/public-endpoints/models/nano-banana-edit.mdx b/public-endpoints/models/nano-banana-edit.mdx
index 451f1e9e..66b2f849 100644
--- a/public-endpoints/models/nano-banana-edit.mdx
+++ b/public-endpoints/models/nano-banana-edit.mdx
@@ -6,7 +6,7 @@ description: "Google's state-of-the-art image editing model for combining multip
Nano Banana Edit is Google's state-of-the-art image editing model that excels at combining multiple source images into a cohesive output. It can take up to four reference images and merge them based on text instructions.
-
+
Test Nano Banana Edit in the Runpod Hub playground.
diff --git a/public-endpoints/models/nano-banana-pro-edit.mdx b/public-endpoints/models/nano-banana-pro-edit.mdx
index 64c16bf8..e5b71698 100644
--- a/public-endpoints/models/nano-banana-pro-edit.mdx
+++ b/public-endpoints/models/nano-banana-pro-edit.mdx
@@ -6,7 +6,7 @@ description: "Google's advanced image editing model with support for up to 14 re
Nano Banana Pro Edit is Google's advanced image editing model that excels at combining multiple source images into a cohesive output. It supports up to 14 reference images and offers multiple resolution options for flexible output quality.
-
+
Test Nano Banana Pro Edit in the Runpod Hub playground.
diff --git a/public-endpoints/models/p-image-edit.mdx b/public-endpoints/models/p-image-edit.mdx
index c6a51fb5..4ee6d358 100644
--- a/public-endpoints/models/p-image-edit.mdx
+++ b/public-endpoints/models/p-image-edit.mdx
@@ -6,7 +6,7 @@ description: "Premium image editing with complex compositions, style transfers,
P-Image Edit is Pruna's premium image editing model that supports complex compositions, style transfers, and targeted edits with text instructions. It can process up to 5 images in a single request.
-
+
Test P-Image Edit in the Runpod Hub playground.
diff --git a/public-endpoints/models/p-image-t2i.mdx b/public-endpoints/models/p-image-t2i.mdx
index 8c996bcc..be136b61 100644
--- a/public-endpoints/models/p-image-t2i.mdx
+++ b/public-endpoints/models/p-image-t2i.mdx
@@ -6,7 +6,7 @@ description: "Ultra-fast text-to-image with automatic prompt enhancement and 2-s
P-Image is Pruna's ultra-fast text-to-image model with automatic prompt enhancement and 2-stage refinement. It generates high-quality images quickly with minimal configuration.
-
+
Test P-Image T2I in the Runpod Hub playground.
diff --git a/public-endpoints/models/qwen-image-edit-2511-lora.mdx b/public-endpoints/models/qwen-image-edit-2511-lora.mdx
index cc9acb10..d8ec06d9 100644
--- a/public-endpoints/models/qwen-image-edit-2511-lora.mdx
+++ b/public-endpoints/models/qwen-image-edit-2511-lora.mdx
@@ -6,7 +6,7 @@ description: "Advanced image editing with complex text rendering and LoRA suppor
Qwen Image Edit 2511 LoRA achieves significant advances in complex text rendering and precise image editing with LoRA support. It enables style customization through LoRA models while maintaining editing precision.
-
+
Test Qwen Image Edit 2511 LoRA in the Runpod Hub playground.
diff --git a/public-endpoints/models/qwen-image-edit-2511.mdx b/public-endpoints/models/qwen-image-edit-2511.mdx
index 47e54130..5b2f6f24 100644
--- a/public-endpoints/models/qwen-image-edit-2511.mdx
+++ b/public-endpoints/models/qwen-image-edit-2511.mdx
@@ -6,7 +6,7 @@ description: "Advanced image editing with strong consistency and multi-person id
Qwen Image Edit 2511 delivers stronger edit consistency, robust multi-person identity and pose consistency, built-in LoRA styles, and enhanced industrial and product design capabilities.
-
+
Test Qwen Image Edit 2511 in the Runpod Hub playground.
diff --git a/public-endpoints/models/qwen-image-edit.mdx b/public-endpoints/models/qwen-image-edit.mdx
index 20371eda..1657e0c1 100644
--- a/public-endpoints/models/qwen-image-edit.mdx
+++ b/public-endpoints/models/qwen-image-edit.mdx
@@ -6,7 +6,7 @@ description: "Image editing with unique text rendering capabilities."
Qwen Image Edit extends Qwen's advanced text rendering capabilities to image editing tasks. It excels at making precise edits to existing images while preserving quality and adding or modifying text within images.
-
+
Test Qwen Image Edit in the Runpod Hub playground.
diff --git a/public-endpoints/models/qwen-image-lora.mdx b/public-endpoints/models/qwen-image-lora.mdx
index 9971c73a..5b05b1e3 100644
--- a/public-endpoints/models/qwen-image-lora.mdx
+++ b/public-endpoints/models/qwen-image-lora.mdx
@@ -6,7 +6,7 @@ description: "Image generation with LoRA support and advanced text rendering."
Qwen Image LoRA extends the base Qwen Image model with LoRA support, allowing you to customize generation with fine-tuned LoRA models. It retains the advanced text rendering capabilities of Qwen Image while enabling style customization.
-
+
Test Qwen Image LoRA in the Runpod Hub playground.
diff --git a/public-endpoints/models/qwen-image.mdx b/public-endpoints/models/qwen-image.mdx
index 14f8df7e..e587290e 100644
--- a/public-endpoints/models/qwen-image.mdx
+++ b/public-endpoints/models/qwen-image.mdx
@@ -6,7 +6,7 @@ description: "Image generation foundation model with advanced text rendering cap
Qwen Image is an image generation foundation model with advanced text rendering capabilities. It excels at generating images that include readable, well-formed text within the image.
-
+
Test Qwen Image in the Runpod Hub playground.
diff --git a/public-endpoints/models/qwen3-32b.mdx b/public-endpoints/models/qwen3-32b.mdx
index bfc10cee..4fdebb27 100644
--- a/public-endpoints/models/qwen3-32b.mdx
+++ b/public-endpoints/models/qwen3-32b.mdx
@@ -6,7 +6,7 @@ description: "Latest generation LLM with advanced reasoning, instruction-followi
Qwen3 32B AWQ is the latest large language model in the Qwen series, offering advancements in reasoning, instruction-following, agent capabilities, and multilingual support. It uses AWQ quantization for efficient inference while maintaining high quality.
-
+
Test Qwen3 32B AWQ in the Runpod Hub playground.
diff --git a/public-endpoints/models/seedance-1-5-pro.mdx b/public-endpoints/models/seedance-1-5-pro.mdx
index 915bd984..3d4f11fa 100644
--- a/public-endpoints/models/seedance-1-5-pro.mdx
+++ b/public-endpoints/models/seedance-1-5-pro.mdx
@@ -6,7 +6,7 @@ description: "Cinematic image-to-video with expressive motion and stable aesthet
Seedance 1.5 Pro I2V generates cinematic, live-action-leaning clips from a text prompt. It preserves the image's subject and composition while adding expressive motion and stable aesthetics.
-
+
Test Seedance 1.5 Pro I2V in the Runpod Hub playground.
diff --git a/public-endpoints/models/seedance-1-pro.mdx b/public-endpoints/models/seedance-1-pro.mdx
index 44f26c3f..757c9ebd 100644
--- a/public-endpoints/models/seedance-1-pro.mdx
+++ b/public-endpoints/models/seedance-1-pro.mdx
@@ -6,7 +6,7 @@ description: "High-performance video generation with multi-shot storytelling cap
Seedance 1.0 Pro is ByteDance's high-performance video generation model with multi-shot storytelling capabilities. It supports both text-to-video and image-to-video generation with customizable duration, frame rate, and resolution.
-
+
Test Seedance 1.0 Pro in the Runpod Hub playground.
diff --git a/public-endpoints/models/seedream-3.mdx b/public-endpoints/models/seedream-3.mdx
index 5baf7abf..2de57da8 100644
--- a/public-endpoints/models/seedream-3.mdx
+++ b/public-endpoints/models/seedream-3.mdx
@@ -6,7 +6,7 @@ description: "Native high-resolution bilingual image generation supporting Chine
Seedream 3.0 is ByteDance's native high-resolution bilingual image generation model supporting both Chinese and English prompts. It generates high-quality images with excellent prompt adherence in both languages.
-
+
Test Seedream 3.0 in the Runpod Hub playground.
diff --git a/public-endpoints/models/seedream-4-edit.mdx b/public-endpoints/models/seedream-4-edit.mdx
index f22cad4d..f4f319b4 100644
--- a/public-endpoints/models/seedream-4-edit.mdx
+++ b/public-endpoints/models/seedream-4-edit.mdx
@@ -6,7 +6,7 @@ description: "New-generation image editing with unified generation and editing a
Seedream 4.0 Edit provides advanced image editing capabilities using the same unified architecture as Seedream 4.0 T2I. It can edit or combine multiple source images based on text instructions.
-
+
Test Seedream 4.0 Edit in the Runpod Hub playground.
diff --git a/public-endpoints/models/seedream-4-t2i.mdx b/public-endpoints/models/seedream-4-t2i.mdx
index bc04a9e2..a9749625 100644
--- a/public-endpoints/models/seedream-4-t2i.mdx
+++ b/public-endpoints/models/seedream-4-t2i.mdx
@@ -6,7 +6,7 @@ description: "New-generation image creation with unified generation and editing
Seedream 4.0 T2I is ByteDance's new-generation image creation model that integrates both generation and editing capabilities within a unified architecture. It produces high-quality images with excellent prompt adherence.
-
+
Test Seedream 4.0 T2I in the Runpod Hub playground.
diff --git a/public-endpoints/models/sora-2-pro.mdx b/public-endpoints/models/sora-2-pro.mdx
index c5b66247..97d7ea8f 100644
--- a/public-endpoints/models/sora-2-pro.mdx
+++ b/public-endpoints/models/sora-2-pro.mdx
@@ -6,7 +6,7 @@ description: "OpenAI's Sora 2 Pro professional-grade video and audio generation
SORA 2 Pro I2V is OpenAI's professional-grade video and audio generation model. It produces higher quality output than the standard SORA 2, with enhanced visual fidelity and more nuanced audio generation.
-
+
Test SORA 2 Pro I2V in the Runpod Hub playground.
diff --git a/public-endpoints/models/sora-2.mdx b/public-endpoints/models/sora-2.mdx
index db7ab28d..9c827d05 100644
--- a/public-endpoints/models/sora-2.mdx
+++ b/public-endpoints/models/sora-2.mdx
@@ -6,7 +6,7 @@ description: "OpenAI's Sora 2 video and audio generation model."
SORA 2 I2V is OpenAI's video and audio generation model that creates dynamic videos from static images. It excels at generating videos with complex actions, ambient sounds, and character dialogue based on detailed text prompts.
-
+
Test SORA 2 I2V in the Runpod Hub playground.
diff --git a/public-endpoints/models/wan-2-1-i2v.mdx b/public-endpoints/models/wan-2-1-i2v.mdx
index c0b6fa3c..f420cfc9 100644
--- a/public-endpoints/models/wan-2-1-i2v.mdx
+++ b/public-endpoints/models/wan-2-1-i2v.mdx
@@ -6,7 +6,7 @@ description: "Open-source image-to-video generation that converts static images
WAN 2.1 I2V 720p is an open-source image-to-video generation model that converts static images into 720p videos. It uses a diffusion transformer architecture to create smooth, natural motion from still images.
-
+
Test WAN 2.1 I2V 720p in the Runpod Hub playground.
diff --git a/public-endpoints/models/wan-2-1-t2v.mdx b/public-endpoints/models/wan-2-1-t2v.mdx
index ea92cd41..f8fb1ec6 100644
--- a/public-endpoints/models/wan-2-1-t2v.mdx
+++ b/public-endpoints/models/wan-2-1-t2v.mdx
@@ -6,7 +6,7 @@ description: "Open-source text-to-video generation for creating 720p videos from
WAN 2.1 T2V 720p is an open-source video generation model for creating 720p videos directly from text prompts. It uses a diffusion transformer architecture to generate high-quality video content without requiring a source image.
-
+
Test WAN 2.1 T2V 720p in the Runpod Hub playground.
diff --git a/public-endpoints/models/wan-2-2-i2v-lora.mdx b/public-endpoints/models/wan-2-2-i2v-lora.mdx
index 21d8c828..9f8887e4 100644
--- a/public-endpoints/models/wan-2-2-i2v-lora.mdx
+++ b/public-endpoints/models/wan-2-2-i2v-lora.mdx
@@ -6,7 +6,7 @@ description: "Open-source video generation with LoRA support for customized came
WAN 2.2 I2V 720p LoRA is an open-source video generation model with LoRA support for customized camera movements and effects. It uses separate high-noise and low-noise LoRA configurations to achieve precise control over motion and style.
-
+
Test WAN 2.2 I2V 720p LoRA in the Runpod Hub playground.
diff --git a/public-endpoints/models/wan-2-2-i2v.mdx b/public-endpoints/models/wan-2-2-i2v.mdx
index 8ca71299..32de2093 100644
--- a/public-endpoints/models/wan-2-2-i2v.mdx
+++ b/public-endpoints/models/wan-2-2-i2v.mdx
@@ -6,7 +6,7 @@ description: "Open-source image-to-video generation using diffusion transformer
WAN 2.2 I2V 720p is an open-source AI video generation model that uses a diffusion transformer architecture for image-to-video generation. It creates smooth, high-quality 720p video content from static images.
-
+
Test WAN 2.2 I2V 720p in the Runpod Hub playground.
diff --git a/public-endpoints/models/wan-2-2-t2v.mdx b/public-endpoints/models/wan-2-2-t2v.mdx
index 40953bfe..ac8bf144 100644
--- a/public-endpoints/models/wan-2-2-t2v.mdx
+++ b/public-endpoints/models/wan-2-2-t2v.mdx
@@ -6,7 +6,7 @@ description: "Open-source text-to-video generation using diffusion transformer a
WAN 2.2 T2V 720p is an open-source AI video generation model that uses a diffusion transformer architecture for text-to-video generation. It creates 720p videos directly from text prompts without requiring a source image.
-
+
Test WAN 2.2 T2V 720p in the Runpod Hub playground.
diff --git a/public-endpoints/models/wan-2-5.mdx b/public-endpoints/models/wan-2-5.mdx
index db83336a..126fdf46 100644
--- a/public-endpoints/models/wan-2-5.mdx
+++ b/public-endpoints/models/wan-2-5.mdx
@@ -6,7 +6,7 @@ description: "Image-to-video generation model with prompt expansion support."
WAN 2.5 is Alibaba's image-to-video generation model that creates videos from static images. It features optional prompt expansion to automatically enhance your prompts for better results.
-
+
Test WAN 2.5 in the Runpod Hub playground.
diff --git a/public-endpoints/models/wan-2-6-t2i.mdx b/public-endpoints/models/wan-2-6-t2i.mdx
index 1e397ebe..d493936e 100644
--- a/public-endpoints/models/wan-2-6-t2i.mdx
+++ b/public-endpoints/models/wan-2-6-t2i.mdx
@@ -6,7 +6,7 @@ description: "High-quality text-to-image with strong prompt adherence and clean
WAN 2.6 Text-to-Image generates high-quality images from natural-language prompts with strong prompt adherence and clean composition. It produces detailed, coherent images across a variety of styles.
-
+
Test WAN 2.6 T2I in the Runpod Hub playground.
diff --git a/public-endpoints/models/wan-2-6-t2v.mdx b/public-endpoints/models/wan-2-6-t2v.mdx
index 866d46f9..08da08c6 100644
--- a/public-endpoints/models/wan-2-6-t2v.mdx
+++ b/public-endpoints/models/wan-2-6-t2v.mdx
@@ -6,7 +6,7 @@ description: "Text-to-video with cinematic quality, stable motion, and strong in
WAN 2.6 Text-to-Video turns plain prompts into coherent, cinematic clips with crisp detail, stable motion, and strong instruction-following. It supports multiple resolutions and durations up to 15 seconds.
-
+
Test WAN 2.6 T2V in the Runpod Hub playground.
diff --git a/public-endpoints/models/whisper-v3.mdx b/public-endpoints/models/whisper-v3.mdx
index 3f94c97a..7cfd354f 100644
--- a/public-endpoints/models/whisper-v3.mdx
+++ b/public-endpoints/models/whisper-v3.mdx
@@ -6,7 +6,7 @@ description: "State-of-the-art automatic speech recognition for transcribing aud
Whisper V3 Large is OpenAI's state-of-the-art automatic speech recognition model that transcribes audio to text. It supports multiple languages and can handle various audio formats with high accuracy.
-
+
Test Whisper V3 Large in the Runpod Hub playground.
diff --git a/public-endpoints/models/z-image-turbo.mdx b/public-endpoints/models/z-image-turbo.mdx
index 361e14e7..c7b8b62f 100644
--- a/public-endpoints/models/z-image-turbo.mdx
+++ b/public-endpoints/models/z-image-turbo.mdx
@@ -6,7 +6,7 @@ description: "Fast 6B parameter image generation model with text-to-image and im
Z-Image Turbo is a powerful and highly efficient 6B parameter image generation model that supports both text-to-image and image-to-image generation. It delivers high-quality results with fast inference times.
-
+
Test Z-Image Turbo in the Runpod Hub playground.
diff --git a/public-endpoints/overview.mdx b/public-endpoints/overview.mdx
index 314380dd..0f0e97f1 100644
--- a/public-endpoints/overview.mdx
+++ b/public-endpoints/overview.mdx
@@ -2,8 +2,11 @@
title: "Overview"
sidebarTitle: "Overview"
description: "Test and deploy production-ready AI models using Public Endpoints."
+mode: "wide"
---
+
+
@@ -33,20 +36,20 @@ Consider [Runpod Serverless](/serverless/overview) instead if you need custom mo
## Get started
-
-
+
+
Generate your first image in under 5 minutes.
-
+
Use the playground and REST API.
-
+
Integrate with JavaScript and TypeScript projects.
-
+
Browse available models and their parameters.
-
+
Chain multiple endpoints to generate videos from text.
@@ -128,4 +131,4 @@ For complete pricing information, see the [model reference](/public-endpoints/re
- [Quickstart](/public-endpoints/quickstart): Generate your first image in under 5 minutes.
- [Make API requests](/public-endpoints/requests): Use the playground and REST API.
- [Vercel AI SDK](/public-endpoints/ai-sdk): Integrate with JavaScript and TypeScript projects.
-- [Model reference](/public-endpoints/reference): View all available models and their parameters.
+- [Model reference](/public-endpoints/reference): View all available models and their parameters.
\ No newline at end of file
diff --git a/public-endpoints/quickstart.mdx b/public-endpoints/quickstart.mdx
index cf991588..caaa606d 100644
--- a/public-endpoints/quickstart.mdx
+++ b/public-endpoints/quickstart.mdx
@@ -2,8 +2,11 @@
title: "Quickstart"
sidebarTitle: "Quickstart"
description: "Generate your first image with Public Endpoints in under 5 minutes."
+mode: "wide"
---
+
+
This quickstart walks you through generating an image using Runpod Public Endpoints. You'll use the [Flux Schnell](/public-endpoints/models/flux-schnell) model, which is optimized for fast generation.
## Requirements
@@ -112,4 +115,4 @@ Image URLs expire after 7 days. Download images immediately if you need to keep
- [Make API requests](/public-endpoints/requests): Learn about async requests, SDKs, and best practices.
- [Model reference](/public-endpoints/reference): Explore all available models and their parameters.
- [Connect AI coding tools](/public-endpoints/ai-coding-tools): Use Public Endpoints with Cursor, Cline, and OpenCode.
-- [Build a text-to-video pipeline](/tutorials/public-endpoints/text-to-video-pipeline): Chain multiple endpoints to generate videos from text prompts.
+- [Build a text-to-video pipeline](/tutorials/public-endpoints/text-to-video-pipeline): Chain multiple endpoints to generate videos from text prompts.
\ No newline at end of file
diff --git a/public-endpoints/reference.mdx b/public-endpoints/reference.mdx
index a23cdf61..e7811df6 100644
--- a/public-endpoints/reference.mdx
+++ b/public-endpoints/reference.mdx
@@ -1,9 +1,12 @@
---
title: "Available models"
-sidebarTitle: "Overview"
+sidebarTitle: "Models"
description: "Browse all available models for Runpod Public Endpoints."
+mode: "wide"
---
+
+
This page lists all available models for Runpod Public Endpoints. Select a model below to view its parameters, pricing, and usage examples. You can also browse and test models in the [Runpod Hub playground](https://console.runpod.io/hub?tabSelected=public_endpoints).
## Image models
@@ -78,4 +81,4 @@ Transcribe speech or generate audio from text.
- [Quickstart](/public-endpoints/quickstart): Get started with your first API request.
- [Make API requests](/public-endpoints/requests): Learn about request/response formats.
- [Vercel AI SDK](/public-endpoints/ai-sdk): Use the TypeScript SDK for easier integration.
-- [Build a text-to-video pipeline](/tutorials/public-endpoints/text-to-video-pipeline): Chain multiple endpoints in a Python application.
+- [Build a text-to-video pipeline](/tutorials/public-endpoints/text-to-video-pipeline): Chain multiple endpoints in a Python application.
\ No newline at end of file
diff --git a/references/gpu-types.mdx b/references/gpu-types.mdx
index eb62b735..e13f1d86 100644
--- a/references/gpu-types.mdx
+++ b/references/gpu-types.mdx
@@ -1,7 +1,9 @@
---
title: "GPU types"
description: "Explore the GPUs available on Runpod."
+mode: "wide"
---
+
For information on pricing, see [GPU pricing](https://www.runpod.io/gpu-instance/pricing).
@@ -71,4 +73,4 @@ Use GPU pools when defining requirements for repositories published to the [Runp
| `ADA_48_PRO` | L40, L40S, 6000 Ada | 48 GB |
| `AMPERE_80` | A100 | 80 GB |
| `ADA_80_PRO` | H100 | 80 GB |
-| `HOPPER_141` | H200 | 141 GB |
\ No newline at end of file
+| `HOPPER_141` | H200 | 141 GB |
diff --git a/release-notes.mdx b/release-notes.mdx
index 6f880106..3cfcba8f 100644
--- a/release-notes.mdx
+++ b/release-notes.mdx
@@ -4,6 +4,56 @@ sidebarTitle: "Product updates"
description: "New features, fixes, and improvements for the Runpod platform."
---
+
+## Flash beta: Run Python functions on cloud GPUs
+
+[Flash](/flash/overview) is now in public beta. Flash is a Python SDK that lets you run functions on Runpod Serverless GPUs with a single decorator:
+
+```python
+from runpod_flash import Endpoint, GpuType
+
+@Endpoint(
+ name="hello-gpu",
+ gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
+ dependencies=["torch"]
+)
+async def hello(): # This function runs on Runpod
+ import torch
+ gpu_name = torch.cuda.get_device_name(0)
+ print(f"Hello from your GPU! ({gpu_name})")
+ return {"gpu": gpu_name}
+
+asyncio.run(hello())
+print("Done!") # This runs locally
+```
+
+**Key features:**
+
+- **Remote execution**: Mark functions with `@Endpoint` to run on GPUs/CPUs automatically.
+- **Auto-scaling**: Workers scale from 0 to N based on demand.
+- **Dependency management**: Packages install automatically on remote workers.
+- **Two patterns**: Queue-based endpoints for batch work, load-balanced endpoints for REST APIs
+- **Flash apps**: Build production-ready APIs with `flash init`, `flash run`, and `flash deploy`
+
+**Get started:**
+
+
+
+ Learn more about Flash.
+
+
+ Run your first GPU workload in 5 minutes.
+
+
+ Learn queue-based and load-balanced patterns.
+
+
+ Development and deployment commands.
+
+
+
+
+
## New Public Endpoints and expanded examples
diff --git a/serverless/overview.mdx b/serverless/overview.mdx
index 339d43e5..52cb291d 100644
--- a/serverless/overview.mdx
+++ b/serverless/overview.mdx
@@ -20,10 +20,10 @@ Runpod Serverless is a cloud computing platform that lets you serve AI models fo
To get started with Serverless, follow one of the following guides to deploy your first endpoint:
-
+
Write a handler function, build a worker image, create an endpoint, and send your first request.
-
+
Deploy a Stable Diffusion endpoint to generate images at scale.
@@ -257,12 +257,26 @@ Runpod maintains a collection of s that you can use to
## Next steps
-Ready to get started with Runpod Serverless?
-
-* [Build your first worker.](/serverless/workers/custom-worker)
-* [Learn more about endpoints.](/serverless/endpoints/overview)
-* [Learn more about workers.](/serverless/workers/overview)
-* [Learn how to build handler functions.](/serverless/workers/handler-functions)
-* [Deploy large language models in minutes with vLLM.](/serverless/vllm/overview)
-* [Review storage options for your endpoints.](/serverless/storage/overview)
-* [Learn how to send requests to your endpoints.](/serverless/endpoints/send-requests)
+
+
+ Create and deploy a custom Serverless worker.
+
+
+ Learn how to configure and manage endpoints.
+
+
+ Understand how workers process requests.
+
+
+ Write handler functions to process incoming requests.
+
+
+ Deploy large language models in minutes.
+
+
+ Review storage options for your endpoints.
+
+
+ Learn how to structure and send requests to endpoints.
+
+
diff --git a/serverless/quickstart.mdx b/serverless/quickstart.mdx
index c102f504..76d4583f 100644
--- a/serverless/quickstart.mdx
+++ b/serverless/quickstart.mdx
@@ -240,9 +240,17 @@ Congratulations! You've successfully deployed and tested your first Serverless e
## Next steps
-Now that you've learned the basics, you're ready to:
-
-* [Create more advanced handler functions.](/serverless/workers/handler-functions)
-* [Update your Dockerfile with AI/ML models and other dependencies.](/serverless/workers/create-dockerfile)
-* [Learn how to structure and send requests to your endpoint.](/serverless/endpoints/send-requests)
-* [Manage your Serverless endpoints in the Runpod console.](/serverless/endpoints/overview)
\ No newline at end of file
+
+
+ Create more advanced handler functions.
+
+
+ Add AI/ML models and other dependencies to your worker.
+
+
+ Learn how to structure and send requests to your endpoint.
+
+
+ Configure and manage your Serverless endpoints.
+
+
\ No newline at end of file
diff --git a/snippets/tooltips.jsx b/snippets/tooltips.jsx
index c700bd92..553d498e 100644
--- a/snippets/tooltips.jsx
+++ b/snippets/tooltips.jsx
@@ -83,7 +83,7 @@ export const WorkerTooltip = () => {
export const WorkersTooltip = () => {
return (
- worker
+ workers
);
};
@@ -93,6 +93,12 @@ export const EndpointTooltip = () => {
);
};
+export const EndpointsTooltip = () => {
+ return (
+ endpoints
+ );
+};
+
export const QueueBasedEndpointTooltip = () => {
return (
queue-based endpoint
@@ -288,3 +294,17 @@ export const ServingTooltip = () => {
serving
);
};
+
+// FLASH
+
+export const ResourceConfigurationTooltip = () => {
+ return (
+ resource configuration
+ );
+};
+
+export const ResourceConfigurationsTooltip = () => {
+ return (
+ resource configurations
+ );
+};
diff --git a/style.css b/style.css
new file mode 100644
index 00000000..79147e98
--- /dev/null
+++ b/style.css
@@ -0,0 +1,95 @@
+/* Custom styles for Runpod documentation */
+
+/* Add your custom CSS rules here */
+
+/* Page-specific styles for overview page only */
+/* This uses a descendant selector to only target elements when .overview-page-wrapper exists */
+body:has(.overview-page-wrapper) #content-container {
+ max-width: 1080px !important;
+ /* Use min() to ensure width never exceeds available space */
+ width: min(1080px, calc(100vw - 304px - 128px)) !important;
+ /* Center with equal margins, but use max() to prevent negative margins */
+ margin-left: calc(304px + max(0px, (100vw - 304px - min(1080px, calc(100vw - 304px - 128px)) - 128px) / 2)) !important;
+ margin-right: auto !important;
+ margin-top: 48px !important;
+ padding-left: 64px !important;
+ padding-right: 64px !important;
+}
+
+body:has(.overview-page-wrapper) #content-area {
+ max-width: none !important;
+ width: 100% !important;
+}
+
+/* Responsive adjustments for overview page */
+@media (max-width: 1400px) {
+ body:has(.overview-page-wrapper) #content-container {
+ max-width: 900px !important;
+ width: min(900px, calc(100vw - 304px - 128px)) !important;
+ margin-left: calc(304px + max(0px, (100vw - 304px - min(900px, calc(100vw - 304px - 128px)) - 128px) / 2)) !important;
+ }
+}
+
+@media (max-width: 1100px) {
+ body:has(.overview-page-wrapper) #content-container {
+ max-width: 800px !important;
+ width: min(800px, calc(100vw - 304px - 128px)) !important;
+ margin-left: calc(304px + max(0px, (100vw - 304px - min(800px, calc(100vw - 304px - 128px)) - 128px) / 2)) !important;
+ }
+}
+
+/* Below 1024px: sidebar disappears, center normally with respect to full viewport */
+@media (max-width: 1024px) {
+ body:has(.overview-page-wrapper) #content-container {
+ max-width: 800px !important;
+ width: calc(100% - 64px) !important;
+ margin-left: auto !important;
+ margin-right: auto !important;
+ padding-left: 32px !important;
+ padding-right: 32px !important;
+ }
+}
+
+/* Extra small screens - stack everything */
+@media (max-width: 600px) {
+ body:has(.overview-page-wrapper) #content-container {
+ width: 100vw !important;
+ margin-left: 0 !important;
+ padding-left: 16px !important;
+ padding-right: 16px !important;
+ }
+}
+
+/* Responsive card stacking for overview page */
+/* Try multiple selectors to target card groups */
+@media (max-width: 1200px) {
+ body:has(.overview-page-wrapper) [data-component="card-group"],
+ body:has(.overview-page-wrapper) div[style*="grid-template-columns"],
+ body:has(.overview-page-wrapper) .card-group {
+ grid-template-columns: repeat(2, 1fr) !important;
+ }
+}
+
+/* 1 column on small screens */
+@media (max-width: 768px) {
+ body:has(.overview-page-wrapper) [data-component="card-group"],
+ body:has(.overview-page-wrapper) div[style*="grid-template-columns"],
+ body:has(.overview-page-wrapper) .card-group {
+ grid-template-columns: repeat(1, 1fr) !important;
+ }
+}
+
+/* Global card icon styles - align to top-left */
+a.card div.h-6.w-6 {
+ align-self: flex-start !important;
+ margin-top: 0.25rem !important;
+}
+
+/* Fix paragraph spacing inside overview wrapper */
+.overview-page-wrapper p {
+ margin-bottom: 1rem;
+}
+
+.overview-page-wrapper p + p {
+ margin-top: 1rem;
+}
diff --git a/tutorials/flash/build-rest-api-with-load-balancer.mdx b/tutorials/flash/build-rest-api-with-load-balancer.mdx
new file mode 100644
index 00000000..d039ff20
--- /dev/null
+++ b/tutorials/flash/build-rest-api-with-load-balancer.mdx
@@ -0,0 +1,700 @@
+---
+title: "Build a REST API with Flash"
+sidebarTitle: "Build a REST API"
+description: "Learn how to build a production-ready REST API using Flash load-balanced endpoints with custom HTTP routes."
+tag: "BETA"
+---
+
+This tutorial shows you how to build a REST API using Flash load-balanced endpoints. You'll create a multi-route API that handles text processing, demonstrates both CPU and GPU endpoints, and deploys to production.
+
+## What you'll learn
+
+In this tutorial you'll learn how to:
+
+- Create load-balanced endpoints with the `Endpoint` class
+- Define multiple routes on a single endpoint with `.get()`, `.post()`, and other HTTP method decorators
+- Add GPU-accelerated routes for ML inference
+- Test your API locally with `flash run`
+- Deploy your API to production with `flash deploy`
+- Call your deployed endpoints with proper authentication
+
+## Requirements
+
+- You've [created a Runpod account](/get-started/manage-accounts)
+- You've [created a Runpod API key](/get-started/api-keys)
+- You've installed [Python 3.10 or higher](https://www.python.org/downloads/)
+- You've completed the [Flash quickstart](/flash/quickstart) or are familiar with Flash basics
+
+## What you'll build
+
+By the end of this tutorial, you'll have a working REST API that:
+
+- Accepts text input via `POST /analyze`
+- Returns system health via `GET /health`
+- Provides API information via `GET /info`
+- Runs GPU-accelerated sentiment analysis via `POST /sentiment` (optional GPU route)
+- Deploys to Runpod Serverless with proper authentication
+
+## Step 1: Set up your project
+
+Create a new directory for your project and set up a Python virtual environment:
+
+```bash
+mkdir flash-api
+cd flash-api
+```
+
+Install Flash using [uv](https://docs.astral.sh/uv/):
+
+```bash
+uv venv
+source .venv/bin/activate
+uv pip install runpod-flash
+```
+
+Set your API key in the environment:
+
+```bash
+export RUNPOD_API_KEY=YOUR_API_KEY
+
+# Or create a .env file
+echo "RUNPOD_API_KEY=YOUR_API_KEY" > .env
+```
+
+Replace `YOUR_API_KEY` with your actual Runpod API key.
+
+## Step 2: Create the API server file
+
+Create a new file called `api.py`:
+
+```bash
+touch api.py
+```
+
+## Step 3: Define the load-balanced endpoint
+
+Add the following code to `api.py`:
+
+```python
+from runpod_flash import Endpoint
+
+# CPU load-balanced endpoint for general API routes
+api = Endpoint(
+ name="text-api",
+ cpu="cpu5c-4-8", # 4 vCPU, 8GB RAM
+ workers=(0, 3), # Scale from 0 to 3 workers
+ idle_timeout=600 # Keep workers active for 10 minutes
+)
+```
+
+This configuration creates a CPU load-balanced endpoint that can handle multiple HTTP routes.
+
+
+**Worker Quota Considerations**: The `workers` setting determines the maximum number of concurrent workers. Standard Runpod accounts have a total quota of 30 workers across all endpoints. If you have other endpoints running, you may need to reduce `workers` to `(0, 1)`. Check your quota in the [Runpod console](https://www.runpod.io/console/serverless).
+
+
+## Step 4: Add API routes
+
+Add three routes to your API - health check, info, and text analysis:
+
+```python
+@api.get("/health")
+async def health_check() -> dict:
+ """Health check endpoint for monitoring."""
+ return {
+ "status": "healthy",
+ "service": "text-api",
+ "version": "1.0.0"
+ }
+
+@api.get("/info")
+async def get_info() -> dict:
+ """API information endpoint."""
+ return {
+ "name": "Text Analysis API",
+ "version": "1.0.0",
+ "endpoints": [
+ {"method": "GET", "path": "/health", "description": "Health check"},
+ {"method": "GET", "path": "/info", "description": "API information"},
+ {"method": "POST", "path": "/analyze", "description": "Analyze text"}
+ ]
+ }
+
+@api.post("/analyze")
+async def analyze_text(text: str) -> dict:
+ """Analyze text and return statistics."""
+ words = text.split()
+ word_count = len(words)
+ char_count = len(text)
+ avg_word_length = sum(len(word) for word in words) / word_count if word_count > 0 else 0
+
+ return {
+ "text": text,
+ "statistics": {
+ "word_count": word_count,
+ "character_count": char_count,
+ "average_word_length": round(avg_word_length, 2),
+ "sentence_count": text.count('.') + text.count('!') + text.count('?')
+ }
+ }
+```
+
+All three routes share the same `api` endpoint, meaning they deploy to a single Serverless endpoint.
+
+## Step 5: Add a GPU-accelerated route (optional)
+
+For GPU-accelerated sentiment analysis, add a separate endpoint:
+
+```python
+from runpod_flash import Endpoint, GpuGroup
+
+# GPU endpoint for ML inference
+gpu_api = Endpoint(
+ name="gpu-sentiment",
+ gpu=GpuGroup.ANY, # Use any available GPU for better availability
+ workers=(0, 1), # Scale from 0 to 1 worker
+ idle_timeout=300, # 5 minutes
+ dependencies=["transformers", "torch"]
+)
+
+@gpu_api.post("/sentiment")
+async def analyze_sentiment(text: str) -> dict:
+ """Analyze sentiment using a pretrained model."""
+ from transformers import pipeline
+ import torch
+
+ # Load sentiment analysis pipeline
+ device = 0 if torch.cuda.is_available() else -1
+ sentiment_analyzer = pipeline(
+ "sentiment-analysis",
+ model="distilbert-base-uncased-finetuned-sst-2-english",
+ device=device
+ )
+
+ # Analyze sentiment
+ result = sentiment_analyzer(text)[0]
+
+ return {
+ "text": text,
+ "sentiment": {
+ "label": result["label"],
+ "score": round(result["score"], 4)
+ },
+ "device": "GPU" if torch.cuda.is_available() else "CPU"
+ }
+```
+
+This creates a second endpoint specifically for GPU-accelerated tasks.
+
+
+The sentiment analysis route uses a separate GPU endpoint because it requires different hardware than the CPU routes. This is a common pattern: use CPU endpoints for lightweight API logic and GPU endpoints for ML inference.
+
+**GPU Availability**: Using `GpuGroup.ANY` provides better availability than specific GPU types like `GpuGroup.ADA_24`. First requests to GPU endpoints may take 3-10 minutes due to:
+- GPU provisioning (depends on current availability)
+- Dependency installation (transformers, torch)
+- Model downloads (distilbert is ~250MB)
+
+During high demand periods, GPU provisioning may take longer. Check [GPU availability](https://www.runpod.io/console/serverless) in the console.
+
+
+## Step 6: Add the main execution block
+
+Add the following at the end of `api.py` to enable local testing:
+
+```python
+import asyncio
+
+async def main():
+ """Test the API locally."""
+ print("Testing Text Analysis API\n")
+
+ # Test health check
+ print("1. Testing health check...")
+ health = await health_check()
+ print(f" Result: {health}\n")
+
+ # Test info endpoint
+ print("2. Testing info endpoint...")
+ info = await get_info()
+ print(f" Result: {info}\n")
+
+ # Test text analysis
+ print("3. Testing text analysis...")
+ sample_text = "Flash makes it easy to build REST APIs with GPU acceleration."
+ analysis = await analyze_text(sample_text)
+ print(f" Result: {analysis}\n")
+
+ # Test sentiment analysis (if GPU route is defined)
+ print("4. Testing sentiment analysis...")
+ try:
+ sentiment = await analyze_sentiment(sample_text)
+ print(f" Result: {sentiment}\n")
+ except ModuleNotFoundError as e:
+ print(f" Skipped (dependencies not installed locally): {e}")
+ print(f" Note: This will work when deployed to Flash with dependencies=['transformers', 'torch']\n")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+## Step 7: Test locally
+
+Run your script to test the API locally:
+
+```bash
+python api.py
+```
+
+You should see output similar to:
+
+```text
+Testing Text Analysis API
+
+1. Testing health check...
+ Result: {'status': 'healthy', 'service': 'text-api', 'version': '1.0.0'}
+
+2. Testing info endpoint...
+ Result: {'name': 'Text Analysis API', 'version': '1.0.0', 'endpoints': [...]}
+
+3. Testing text analysis...
+ Result: {'text': '...', 'statistics': {'word_count': 11, ...}}
+
+4. Testing sentiment analysis...
+ Skipped (dependencies not installed locally): No module named 'transformers'
+ Note: This will work when deployed to Flash with dependencies=['transformers', 'torch']
+```
+
+The first three endpoints will run locally. The sentiment endpoint will be skipped unless you install transformers and torch locally, but it will work when deployed to Flash.
+
+
+**Local Testing Limitations**: The GPU sentiment endpoint requires `transformers` and `torch` to be installed locally for testing. For full testing of all endpoints including GPU routes, use `flash run` (covered in Step 9) instead of direct Python execution.
+
+
+## Step 8: Build a Flash app for production
+
+To deploy your API to production, create a Flash app:
+
+```bash
+flash init api-project
+cd api-project
+```
+
+This creates a project structure with separate worker files. Now, split your API code into the appropriate worker files:
+
+### Create `lb_worker.py` (CPU routes):
+
+Replace the contents of `lb_worker.py` with:
+
+```python
+from runpod_flash import Endpoint
+
+# CPU load-balanced endpoint for general API routes
+api = Endpoint(
+ name="text-api",
+ cpu="cpu5c-4-8", # 4 vCPU, 8GB RAM
+ workers=(0, 3), # Scale from 0 to 3 workers
+ idle_timeout=600 # Keep workers active for 10 minutes
+)
+
+@api.get("/health")
+async def health_check() -> dict:
+ """Health check endpoint for monitoring."""
+ return {
+ "status": "healthy",
+ "service": "text-api",
+ "version": "1.0.0"
+ }
+
+@api.get("/info")
+async def get_info() -> dict:
+ """API information endpoint."""
+ return {
+ "name": "Text Analysis API",
+ "version": "1.0.0",
+ "endpoints": [
+ {"method": "GET", "path": "/health", "description": "Health check"},
+ {"method": "GET", "path": "/info", "description": "API information"},
+ {"method": "POST", "path": "/analyze", "description": "Analyze text"}
+ ]
+ }
+
+@api.post("/analyze")
+async def analyze_text(text: str) -> dict:
+ """Analyze text and return statistics."""
+ words = text.split()
+ word_count = len(words)
+ char_count = len(text)
+ avg_word_length = sum(len(word) for word in words) / word_count if word_count > 0 else 0
+
+ return {
+ "text": text,
+ "statistics": {
+ "word_count": word_count,
+ "character_count": char_count,
+ "average_word_length": round(avg_word_length, 2),
+ "sentence_count": text.count('.') + text.count('!') + text.count('?')
+ }
+ }
+```
+
+### Create `gpu_worker.py` (GPU route):
+
+If you added the GPU sentiment route, replace the contents of `gpu_worker.py` with:
+
+```python
+from runpod_flash import Endpoint, GpuGroup
+
+# GPU endpoint for ML inference
+gpu_api = Endpoint(
+ name="gpu-sentiment",
+ gpu=GpuGroup.ANY, # Use any available GPU for better availability
+ workers=(0, 1), # Scale from 0 to 1 worker
+ idle_timeout=300, # 5 minutes
+ dependencies=["transformers", "torch"]
+)
+
+@gpu_api.post("/sentiment")
+async def analyze_sentiment(text: str) -> dict:
+ """Analyze sentiment using a pretrained model."""
+ from transformers import pipeline
+ import torch
+
+ # Load sentiment analysis pipeline
+ device = 0 if torch.cuda.is_available() else -1
+ sentiment_analyzer = pipeline(
+ "sentiment-analysis",
+ model="distilbert-base-uncased-finetuned-sst-2-english",
+ device=device
+ )
+
+ # Analyze sentiment
+ result = sentiment_analyzer(text)[0]
+
+ return {
+ "text": text,
+ "sentiment": {
+ "label": result["label"],
+ "score": round(result["score"], 4)
+ },
+ "device": "GPU" if torch.cuda.is_available() else "CPU"
+ }
+```
+
+### Configure environment:
+
+```bash
+cp .env.example .env
+echo "RUNPOD_API_KEY=YOUR_API_KEY" > .env
+```
+
+Replace `YOUR_API_KEY` with your actual Runpod API key.
+
+## Step 9: Test with the development server
+
+Start the Flash development server:
+
+```bash
+flash run
+```
+
+You'll see output showing all available endpoints:
+
+```text
+Flash Dev Server localhost:8888
+
+┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┓
+┃ Local path ┃ Description ┃ Type ┃
+┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━┩
+│ GET /lb_worker/health │ Health check endpoint for monitoring. │ LB │
+│ GET /lb_worker/info │ API information endpoint. │ LB │
+│ POST /lb_worker/analyze │ Analyze text and return statistics. │ LB │
+│ POST /gpu_worker/sentiment │ Analyze sentiment using a pretrained │ LB │
+│ │ model. │ │
+└─────────────────────────────┴─────────────────────────────────────────┴──────┘
+```
+
+
+**Development Server Path Prefixes**: The `flash run` dev server adds worker file prefixes to routes (e.g., `/lb_worker/health`, `/gpu_worker/sentiment`). When deployed to production, endpoints use the paths as defined in the route decorators (e.g., `/health`, `/sentiment`) without the prefixes.
+
+
+Open http://localhost:8888/docs in your browser to see the interactive API documentation. You can test all your routes directly in the Swagger UI.
+
+Test with curl:
+
+```bash
+# Test health check
+curl -X GET http://localhost:8888/lb_worker/health
+
+# Test text analysis
+curl -X POST http://localhost:8888/lb_worker/analyze \
+ -H "Content-Type: application/json" \
+ -d '{"text": "Flash makes building APIs easy"}'
+
+# Test sentiment analysis (if you added the GPU route)
+# Note: First request may take 1-3 minutes for GPU provisioning and model download
+curl -X POST http://localhost:8888/gpu_worker/sentiment \
+ -H "Content-Type: application/json" \
+ -d '{"text": "I love using Flash for my APIs"}'
+```
+
+Expected responses:
+
+```json
+// Health check
+{
+ "status": "healthy",
+ "service": "text-api",
+ "version": "1.0.0"
+}
+
+// Text analysis
+{
+ "text": "Flash makes building APIs easy",
+ "statistics": {
+ "word_count": 5,
+ "character_count": 30,
+ "average_word_length": 5.2,
+ "sentence_count": 0
+ }
+}
+
+// Sentiment analysis
+{
+ "text": "I love using Flash for my APIs",
+ "sentiment": {
+ "label": "POSITIVE",
+ "score": 0.9998
+ },
+ "device": "GPU"
+}
+```
+
+
+**GPU Cold Starts**: The first request to a GPU endpoint may take 3-10 minutes due to GPU provisioning, dependency installation, and model downloads. During high demand periods, provisioning may take longer. Subsequent requests will be much faster. The default timeout is 60 seconds, which may be too short for the first request. If you encounter timeout errors, wait and retry - the GPU may still be initializing.
+
+
+## Step 10: Deploy to production
+
+When you're ready to deploy, use `flash deploy`:
+
+```bash
+flash deploy
+```
+
+After deployment, Flash displays your endpoint URLs:
+
+```text
+✓ Deployment Complete
+
+Load-balanced endpoints:
+ https://api-abc123.runpod.net (text-api)
+ GET /health
+ GET /info
+ POST /analyze
+
+ https://api-def456.runpod.net (gpu-sentiment)
+ POST /sentiment
+```
+
+## Step 11: Call your deployed API
+
+Call your production endpoints with authentication:
+
+```bash
+# Health check
+curl -X GET https://api-abc123.runpod.net/health \
+ -H "Authorization: Bearer $RUNPOD_API_KEY"
+
+# Text analysis
+curl -X POST https://api-abc123.runpod.net/analyze \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"text": "Flash makes building APIs easy and fast"}'
+
+# GPU sentiment analysis
+curl -X POST https://api-def456.runpod.net/sentiment \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"text": "I love using Flash for my APIs"}'
+```
+
+Expected response:
+
+```json
+{
+ "text": "I love using Flash for my APIs",
+ "sentiment": {
+ "label": "POSITIVE",
+ "score": 0.9998
+ },
+ "device": "GPU"
+}
+```
+
+
+**Production Path Note**: In production, the endpoints use the exact paths defined in your route decorators (e.g., `/health`, `/sentiment`), without the worker file prefixes used in `flash run`.
+
+
+## Understanding the deployment architecture
+
+Your deployed API creates two independent Serverless endpoints:
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'14px','fontFamily':'font-inter'}}}%%
+
+flowchart TB
+ Client([Client])
+
+ subgraph Runpod [RUNPOD SERVERLESS]
+ CPU[text-api endpoint CPU load balancer GET /health GET /info POST /analyze]
+ GPU[gpu-sentiment endpoint GPU load balancer POST /sentiment]
+ end
+
+ Client -->|HTTPS + Auth| CPU
+ Client -->|HTTPS + Auth| GPU
+
+ style Runpod fill:#1a1a2e,stroke:#5F4CFE,stroke-width:2px
+ style Client fill:#4D38F5,stroke:#4D38F5,color:#fff
+ style CPU fill:#5F4CFE,stroke:#5F4CFE,color:#fff
+ style GPU fill:#22C55E,stroke:#22C55E,color:#000
+```
+
+**Key points:**
+- **CPU endpoint** (`text-api`) handles three routes on one Serverless endpoint
+- **GPU endpoint** (`gpu-sentiment`) handles GPU inference on a separate endpoint
+- Both endpoints scale independently based on load
+- All requests require authentication with your API key
+
+## Troubleshooting
+
+### Worker quota exceeded
+
+**Issue**: `Max workers across all endpoints must not exceed your workers quota (30)`
+
+**Solution**:
+1. Check your current worker usage in the [Runpod console](https://www.runpod.io/console/serverless)
+2. Reduce `workers` in your configuration:
+ ```python
+ api = Endpoint(
+ name="text-api",
+ cpu="cpu5c-4-8",
+ workers=(0, 1) # Reduce this value
+ )
+ ```
+3. Clean up unused endpoints before deploying new ones
+
+### GPU endpoint timeout
+
+**Issue**: Request times out after 60 seconds on first GPU endpoint call
+
+**Solutions**:
+1. This is normal for the first request - GPU provisioning takes time
+2. Wait 1-3 minutes and try again
+3. Use `GpuGroup.ANY` instead of specific GPU types for better availability
+4. Consider using CPU for development testing:
+ ```python
+ # For testing without GPU
+ api = Endpoint(name="sentiment-cpu", cpu="cpu5c-4-8")
+ ```
+
+### Port already in use
+
+**Issue**: `ERROR: [Errno 48] Address already in use` when running `flash run`
+
+**Solutions**:
+```bash
+# Use a different port
+flash run --port 8889
+
+# Or kill the process using port 8888
+lsof -ti:8888 | xargs kill -9
+```
+
+### Import errors in sentiment analysis
+
+**Issue**: `ModuleNotFoundError: No module named 'transformers'`
+
+**Solution**: Ensure dependencies are specified on the endpoint:
+
+```python
+gpu_api = Endpoint(
+ name="gpu-sentiment",
+ gpu=GpuGroup.ANY,
+ dependencies=["transformers", "torch"] # Must include these
+)
+```
+
+For local testing, install dependencies manually:
+```bash
+pip install transformers torch
+```
+
+### Endpoint stays in queue
+
+**Issue**: GPU sentiment route stays in `IN_QUEUE` status
+
+**Solutions**:
+1. Check [GPU availability](https://www.runpod.io/console/serverless) in console
+2. Use flexible GPU selection:
+ ```python
+ gpu=GpuGroup.ANY # Use any available GPU
+ ```
+3. Increase worker quota if at limit
+
+## Next steps
+
+Now that you've built a REST API with Flash, you can:
+
+### Add more routes
+
+Expand your API with additional functionality:
+
+```python
+@api.post("/summarize")
+async def summarize_text(text: str, max_length: int = 100) -> dict:
+ """Summarize long text."""
+ # Summarization logic
+ return {"summary": text[:max_length]}
+
+@api.post("/translate")
+async def translate_text(text: str, target_lang: str) -> dict:
+ """Translate text to another language."""
+ # Translation logic
+ return {"translated": text, "target": target_lang}
+```
+
+### Add authentication middleware
+
+Implement custom authentication for your API:
+
+```python
+@api.post("/protected")
+async def protected_route(text: str, api_key: str) -> dict:
+ """Route with custom authentication."""
+ if api_key != "your-secret-key":
+ return {"error": "Unauthorized"}, 401
+ return {"data": "protected content"}
+```
+
+### Monitor your API
+
+- Track endpoint health in the [Runpod console](https://www.runpod.io/console/serverless)
+- Monitor request counts and error rates
+- Adjust `workers` based on traffic patterns
+
+### Use multiple environments
+
+Deploy to different environments for testing:
+
+```bash
+flash deploy --env dev # Development
+flash deploy --env staging # Staging
+flash deploy --env production # Production
+```
+
+## Related resources
+
+- [Configuration reference](/flash/configuration/parameters)
+- [Endpoint functions guide](/flash/create-endpoints)
+- [Deploy Flash apps](/flash/apps/deploy-apps)
+- [Managing endpoints](/flash/managing-endpoints)
diff --git a/tutorials/flash/image-generation-with-sdxl.mdx b/tutorials/flash/image-generation-with-sdxl.mdx
new file mode 100644
index 00000000..08badd75
--- /dev/null
+++ b/tutorials/flash/image-generation-with-sdxl.mdx
@@ -0,0 +1,673 @@
+---
+title: "Generate images with Flash and SDXL"
+sidebarTitle: "Generate images with SDXL"
+description: "Learn how to use Flash with Stable Diffusion XL to generate high-quality images from text prompts."
+tag: "BETA"
+---
+
+This tutorial shows you how to build an image generation script using Flash and Stable Diffusion XL (SDXL). You'll learn how to load a pretrained diffusion model on a GPU worker and generate images from text prompts.
+
+
+
+
+
+## What you'll learn
+
+In this tutorial you'll learn how to:
+
+- Use the Hugging Face diffusers library with Flash.
+- Load and run Stable Diffusion XL models on GPU workers.
+- Generate high-quality images from text prompts.
+- Save generated images to disk.
+- Configure generation parameters like guidance scale and steps.
+
+## Requirements
+
+- You've [created a Runpod account](/get-started/manage-accounts).
+- You've [created a Runpod API key](/get-started/api-keys).
+- You've installed [Python 3.10 or higher](https://www.python.org/downloads/).
+- You've completed the [Flash quickstart](/flash/quickstart) or are familiar with Flash basics.
+
+## What you'll build
+
+By the end of this tutorial, you'll have a working image generation application that:
+
+- Accepts text prompts as input.
+- Generates photorealistic images using Stable Diffusion XL.
+- Runs entirely on Runpod's GPU infrastructure.
+- Saves generated images to your local machine.
+
+## Step 1: Set up your project
+
+Create a new directory for your project and set up a Python virtual environment:
+
+```bash
+mkdir flash-image-generation
+cd flash-image-generation
+```
+
+Install Flash using [uv](https://docs.astral.sh/uv/):
+
+```bash
+uv venv
+source .venv/bin/activate
+uv pip install runpod-flash python-dotenv
+```
+
+Create a `.env` file with your Runpod API key:
+
+```bash
+touch .env && echo "RUNPOD_API_KEY=YOUR_API_KEY" > .env
+```
+
+Replace `YOUR_API_KEY` with your actual API key from the [Runpod console](https://www.runpod.io/console/user/settings).
+
+## Step 2: Understand Stable Diffusion XL
+
+[Stable Diffusion XL (SDXL)](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) is a state-of-the-art text-to-image model from Stability AI. It offers:
+
+- **High-quality images**: Generates photorealistic 1024x1024 images
+- **Better prompt understanding**: Improved text comprehension compared to SD 1.5
+- **Fine details**: Enhanced rendering of hands, faces, and text
+- **Open source**: Available for free on Hugging Face
+
+SDXL requires significant GPU resources:
+- **Model size**: ~7GB of weights
+- **VRAM requirement**: Minimum 16GB (24GB recommended)
+- **Generation time**: 20-40 seconds per image on RTX 4090
+
+We'll use the [diffusers](https://huggingface.co/docs/diffusers/index) library from Hugging Face, which provides a clean Python API for Stable Diffusion models.
+
+## Step 3: Create your project file
+
+Create a new file called `image_generation.py`:
+
+```bash
+touch image_generation.py
+```
+
+Open this file in your code editor. The following steps walk through building the image generation application.
+
+## Step 4: Add imports and configuration
+
+Add the necessary imports and Flash configuration:
+
+```python
+import asyncio
+import base64
+from pathlib import Path
+from dotenv import load_dotenv
+from runpod_flash import Endpoint, GpuGroup
+
+# Load environment variables from .env file
+load_dotenv()
+```
+
+## Step 5: Define the image generation function
+
+Add the endpoint function that will run on the GPU worker:
+
+```python
+@Endpoint(
+ name="image-generation",
+ gpu=[GpuGroup.ADA_24, GpuGroup.AMPERE_24], # 24GB GPUs
+ workers=2,
+ idle_timeout=900, # Keep workers active for 15 minutes
+ dependencies=["diffusers", "torch", "transformers", "accelerate"]
+)
+def generate_image(prompt, negative_prompt="", num_steps=30, guidance_scale=7.5):
+ """Generate an image using Stable Diffusion XL."""
+ import torch
+ from diffusers import StableDiffusionXLPipeline
+ import base64
+ from io import BytesIO
+
+ # Load the SDXL model
+ model_id = "stabilityai/stable-diffusion-xl-base-1.0"
+ pipe = StableDiffusionXLPipeline.from_pretrained(
+ model_id,
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+ variant="fp16"
+ )
+
+ # Move model to GPU
+ device = "cuda" if torch.cuda.is_available() else "cpu"
+ pipe = pipe.to(device)
+
+ # Generate image
+ image = pipe(
+ prompt=prompt,
+ negative_prompt=negative_prompt,
+ num_inference_steps=num_steps,
+ guidance_scale=guidance_scale,
+ height=1024,
+ width=1024
+ ).images[0]
+
+ # Convert image to base64 for transmission
+ buffered = BytesIO()
+ image.save(buffered, format="PNG")
+ img_str = base64.b64encode(buffered.getvalue()).decode()
+
+ return {
+ "image_base64": img_str,
+ "prompt": prompt,
+ "negative_prompt": negative_prompt,
+ "num_steps": num_steps,
+ "guidance_scale": guidance_scale,
+ "device": device,
+ "resolution": "1024x1024"
+ }
+```
+
+**Configuration breakdown**:
+
+- **`name="image-generation"`**: Identifies your endpoint in the Runpod console.
+- **`gpu=[GpuGroup.ADA_24, GpuGroup.AMPERE_24]`**: Uses RTX 4090 or L4/A5000 GPUs (both have 24GB VRAM, sufficient for SDXL).
+- **`workers=2`**: Allows up to 2 parallel workers.
+- **`idle_timeout=900`**: Keeps workers active for 15 minutes (SDXL models are large, so we want longer caching).
+
+
+SDXL requires at least 16GB VRAM. Using 24GB GPUs provides comfortable headroom and faster generation.
+
+
+This function:
+- Loads the SDXL model from Hugging Face.
+- Moves the model to the GPU.
+- Generates an image from the prompt.
+- Encodes the image as base64.
+- Returns the image as a base64 string (and other metadata).
+
+Expand this section for a full breakdown:
+
+
+
+**Dependencies**: The function requires four packages:
+ - `diffusers`: Hugging Face library for diffusion models
+ - `torch`: PyTorch for GPU computation
+ - `transformers`: Text encoder dependencies
+ - `accelerate`: Efficient model loading
+
+**Model loading**:
+ ```python
+ pipe = StableDiffusionXLPipeline.from_pretrained(
+ model_id,
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+ variant="fp16"
+ )
+ ```
+ This downloads SDXL from Hugging Face. Key parameters:
+ - `torch_dtype=torch.float16`: Use half-precision (saves VRAM, faster)
+ - `use_safetensors=True`: Use safe tensor format
+ - `variant="fp16"`: Download the fp16 version (~7GB instead of ~14GB)
+
+**GPU acceleration**:
+ ```python
+ pipe = pipe.to(device)
+ ```
+ Moves the entire pipeline (text encoder, UNet, VAE) to GPU.
+
+**Image generation**:
+ ```python
+ image = pipe(
+ prompt=prompt,
+ negative_prompt=negative_prompt,
+ num_inference_steps=num_steps,
+ guidance_scale=guidance_scale,
+ height=1024,
+ width=1024
+ ).images[0]
+ ```
+
+ Parameters:
+ - **`prompt`**: What you want to see in the image
+ - **`negative_prompt`**: What you don't want (e.g., "blurry, low quality")
+ - **`num_inference_steps`**: More steps = better quality but slower (20-50 typical)
+ - **`guidance_scale`**: How closely to follow the prompt (7-10 recommended)
+ - **`height/width`**: SDXL is trained for 1024x1024
+
+**Image encoding**:
+ ```python
+ buffered = BytesIO()
+ image.save(buffered, format="PNG")
+ img_str = base64.b64encode(buffered.getvalue()).decode()
+ ```
+ We encode the image as base64 to return it through Flash. This allows us to transmit the image data as a string.
+
+
+
+## Step 6: Add the main function and image saving
+
+Create functions to call the generator and save images:
+
+```python
+def save_image(base64_string, filename):
+ """Save a base64-encoded image to disk."""
+ import base64
+ from PIL import Image
+ from io import BytesIO
+
+ # Decode base64 string
+ img_data = base64.b64decode(base64_string)
+
+ # Open and save image
+ image = Image.open(BytesIO(img_data))
+ image.save(filename)
+ print(f"✓ Image saved to {filename}")
+
+async def main():
+ print("Generating image with Stable Diffusion XL on Runpod GPU...")
+ print("This may take 1-2 minutes on first run (downloading model)...\n")
+
+ # Define your prompt
+ prompt = "A serene landscape with mountains, a lake, and sunset, highly detailed, photorealistic"
+ negative_prompt = "blurry, low quality, distorted, ugly"
+
+ # Generate image
+ result = await generate_image(
+ prompt=prompt,
+ negative_prompt=negative_prompt,
+ num_steps=30,
+ guidance_scale=7.5
+ )
+
+ # Save the generated image
+ output_dir = Path("generated_images")
+ output_dir.mkdir(exist_ok=True)
+
+ filename = output_dir / "sdxl_output.png"
+ save_image(result["image_base64"], filename)
+
+ # Display metadata
+ print(f"\n{'='*60}")
+ print("GENERATION DETAILS")
+ print('='*60)
+ print(f"Prompt: {result['prompt']}")
+ print(f"Negative prompt: {result['negative_prompt']}")
+ print(f"Steps: {result['num_steps']}")
+ print(f"Guidance scale: {result['guidance_scale']}")
+ print(f"Resolution: {result['resolution']}")
+ print(f"Device: {result['device']}")
+ print('='*60)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+This main function:
+
+- Calls the remote function with `await`.
+- Creates a `generated_images` directory if it doesn't exist.
+- Decodes and saves the base64 image to disk.
+- Displays generation metadata.
+
+## Step 7: Run your first generation
+
+Run the application:
+
+```bash
+python image_generation.py
+```
+
+**First run output** (takes 2-3 minutes):
+
+```text
+Generating image with Stable Diffusion XL on Runpod GPU...
+This may take 1-2 minutes on first run (downloading model)...
+
+Creating endpoint: server_Endpoint_a1b2c3d4
+Provisioning Serverless endpoint...
+Endpoint ready
+Executing function on RunPod endpoint ID: xvf32dan8rcilp
+Initial job status: IN_QUEUE
+Downloading model weights from Hugging Face...
+Model loaded, generating image...
+Job completed, output received
+✓ Image saved to generated_images/sdxl_output.png
+
+============================================================
+GENERATION DETAILS
+============================================================
+Prompt: A serene landscape with mountains, a lake, and sunset, highly detailed, photorealistic
+Negative prompt: blurry, low quality, distorted, ugly
+Steps: 30
+Guidance scale: 7.5
+Resolution: 1024x1024
+Device: cuda
+============================================================
+```
+
+**Subsequent runs** (takes 30-40 seconds):
+
+```text
+Generating image with Stable Diffusion XL on Runpod GPU...
+
+Resource Endpoint_a1b2c3d4 already exists, reusing.
+Executing function on RunPod endpoint ID: xvf32dan8rcilp
+Initial job status: IN_QUEUE
+Job completed, output received
+✓ Image saved to generated_images/sdxl_output.png
+
+[Results appear]
+```
+
+Open `generated_images/sdxl_output.png` to see your generated image!
+
+
+The first run downloads ~7GB of model weights, which takes 1-2 minutes. Subsequent runs reuse the cached model and complete in 30-40 seconds.
+
+
+## Step 8: Experiment with different prompts
+
+Try various prompts to see SDXL's capabilities:
+
+```python
+async def main():
+ # Create output directory
+ output_dir = Path("generated_images")
+ output_dir.mkdir(exist_ok=True)
+
+ # Try different prompts
+ prompts = [
+ {
+ "prompt": "A cyberpunk city at night with neon lights, flying cars, rain, cinematic",
+ "negative": "blurry, low quality",
+ "filename": "cyberpunk_city.png"
+ },
+ {
+ "prompt": "A cute corgi puppy wearing a space suit, floating in space, highly detailed",
+ "negative": "distorted, ugly, bad anatomy",
+ "filename": "space_corgi.png"
+ },
+ {
+ "prompt": "An ancient wizard's study filled with books, potions, magical artifacts, candlelight",
+ "negative": "blurry, modern, plastic",
+ "filename": "wizard_study.png"
+ }
+ ]
+
+ for i, p in enumerate(prompts, 1):
+ print(f"\n{'='*60}")
+ print(f"Generating image {i}/{len(prompts)}")
+ print(f"Prompt: {p['prompt'][:50]}...")
+ print('='*60)
+
+ result = await generate_image(
+ prompt=p['prompt'],
+ negative_prompt=p['negative'],
+ num_steps=30,
+ guidance_scale=7.5
+ )
+
+ filename = output_dir / p['filename']
+ save_image(result["image_base64"], filename)
+ print(f"✓ Saved to {filename}\n")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+Run it:
+
+```bash
+python image_generation.py
+```
+
+You'll see three different images generated sequentially on the same GPU worker. Each generation takes about 30-40 seconds after the first one.
+
+## Understanding generation parameters
+
+Let's explore how different parameters affect image quality:
+
+### Number of inference steps
+
+```python
+# Fast but lower quality (15-20 steps)
+result = await generate_image(prompt, num_steps=20)
+
+# Balanced (30 steps) - recommended
+result = await generate_image(prompt, num_steps=30)
+
+# High quality but slower (50 steps)
+result = await generate_image(prompt, num_steps=50)
+```
+
+**Effects**:
+- **15-20 steps**: Faster (15-20 seconds) but less refined details.
+- **30 steps**: Good balance of quality and speed (30-40 seconds) - **recommended**.
+- **50+ steps**: Diminishing returns, minimal quality improvement.
+
+### Guidance scale
+
+```python
+# Low guidance - more creative, less faithful to prompt
+result = await generate_image(prompt, guidance_scale=5.0)
+
+# Medium guidance - balanced (recommended)
+result = await generate_image(prompt, guidance_scale=7.5)
+
+# High guidance - very faithful to prompt, may oversaturate
+result = await generate_image(prompt, guidance_scale=12.0)
+```
+
+**Effects**:
+- **3-5**: More artistic freedom, less literal interpretation.
+- **7-10**: Balanced, follows prompt closely - **recommended**.
+- **12+**: Very literal, may produce oversaturated or exaggerated images.
+
+### Negative prompts
+
+Negative prompts tell the model what to avoid:
+
+```python
+# Good negative prompts for photorealistic images
+negative_prompt = "blurry, low quality, distorted, ugly, bad anatomy, watermark"
+
+# Good negative prompts for artistic images
+negative_prompt = "realistic, photograph, blurry, low quality"
+
+# Good negative prompts for portraits
+negative_prompt = "distorted face, bad anatomy, extra limbs, low quality"
+```
+
+Use negative prompts to:
+
+- Remove common artifacts ("distorted", "low quality").
+- Avoid unwanted styles ("cartoon", "3D render").
+- Fix common issues ("bad anatomy", "extra fingers").
+
+## Troubleshooting
+
+### Out of memory error
+
+**Issue**: `RuntimeError: CUDA out of memory`.
+
+**Cause**: SDXL requires significant VRAM (16GB minimum).
+
+**Solutions**:
+1. Verify you're using 24GB GPUs:
+ ```python
+ gpu=[GpuGroup.ADA_24, GpuGroup.AMPERE_24] # 24GB GPUs
+ ```
+
+2. Use half-precision (already in the example):
+ ```python
+ torch_dtype=torch.float16 # Half precision
+ ```
+
+3. If still failing, use 48GB GPUs:
+ ```python
+ gpu=GpuGroup.AMPERE_48 # A40/A6000 with 48GB
+ ```
+
+### Model download fails
+
+**Issue**: `Error: Failed to download model from Hugging Face`.
+
+**Solutions**:
+1. Increase execution timeout for first run:
+ ```python
+ @Endpoint(
+ name="image-generation",
+ gpu=GpuGroup.ADA_24,
+ execution_timeout_ms=600000 # 10 minutes for first download
+ )
+ ```
+
+2. Check Hugging Face Hub status at [status.huggingface.co](https://status.huggingface.co).
+
+3. Try a smaller model first to test connectivity:
+ ```python
+ model_id = "runwayml/stable-diffusion-v1-5" # Smaller SD 1.5
+ ```
+
+### Image quality is poor
+
+**Issue**: Generated images look blurry or low quality.
+
+**Solutions**:
+1. Increase inference steps:
+ ```python
+ num_steps=40 # More steps = better quality
+ ```
+
+2. Adjust guidance scale:
+ ```python
+ guidance_scale=8.5 # Higher guidance
+ ```
+
+3. Improve your prompt:
+ ```python
+ prompt = "A detailed portrait, highly detailed, sharp focus, 8k, professional photography"
+ ```
+
+4. Add quality keywords to your prompt:
+ - "highly detailed"
+ - "sharp focus"
+ - "8k"
+ - "photorealistic"
+ - "professional"
+
+### Slow generation
+
+**Issue**: Image generation takes >60 seconds per image.
+
+**Possible causes**:
+1. Worker scaled down (cold start).
+2. Model not cached.
+3. Too many inference steps.
+
+**Solutions**:
+1. Increase `idle_timeout` to keep workers active:
+ ```python
+ idle_timeout=1800 # Keep active for 30 minutes
+ ```
+
+2. Reduce inference steps:
+ ```python
+ num_steps=20 # Faster but slightly lower quality
+ ```
+
+3. Set `workers=(1, 2)` to always have a warm worker ready.
+
+### Images look distorted or have artifacts
+
+**Issue**: Generated images have weird artifacts or distortions.
+
+**Solutions**:
+1. Use negative prompts:
+ ```python
+ negative_prompt="distorted, ugly, bad anatomy, extra limbs, disfigured"
+ ```
+
+2. Adjust guidance scale (try 7-9 range):
+ ```python
+ guidance_scale=8.0
+ ```
+
+3. Increase inference steps for better refinement:
+ ```python
+ num_steps=35
+ ```
+
+## Next steps
+
+Now that you've built an image generation script with Flash, you can:
+
+### Try other Stable Diffusion models
+
+Explore different models from Hugging Face:
+
+```python
+# SDXL Turbo - 4x faster, 1 step generation
+model_id = "stabilityai/sdxl-turbo"
+
+# Stable Diffusion 1.5 - smaller, faster
+model_id = "runwayml/stable-diffusion-v1-5"
+
+# Stable Diffusion 2.1 - better at artistic styles
+model_id = "stabilityai/stable-diffusion-2-1"
+```
+
+### Add image-to-image generation
+
+Use an existing image as a starting point:
+
+```python
+from diffusers import StableDiffusionXLImg2ImgPipeline
+
+# Load img2img pipeline
+pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(...)
+
+# Generate variations of an existing image
+image = pipe(prompt, image=init_image, strength=0.75).images[0]
+```
+
+### Build a Flash app
+
+Convert your script to a production [Flash app](/flash/apps/overview):
+
+```bash
+flash init image-generation-app
+# Move your function to workers/gpu/endpoint.py
+# Add FastAPI routes for HTTP API
+flash deploy
+```
+
+### Optimize with network volumes
+
+Use [network volumes](/flash/configuration/storage) to cache models across workers:
+
+```python
+from runpod_flash import Endpoint, GpuGroup, NetworkVolume
+
+vol = NetworkVolume(name="model-cache") # Finds existing or creates new
+
+@Endpoint(
+ name="image-generation",
+ gpu=GpuGroup.ADA_24,
+ volume=vol,
+ dependencies=["diffusers", "torch", "transformers", "accelerate"]
+)
+def generate_image(prompt, ...):
+ # Models at /runpod-volume/ persist across workers
+ ...
+```
+
+### Explore advanced features
+
+- **LoRA fine-tuning**: Customize SDXL for specific styles.
+- **ControlNet**: Guide generation with edge maps, depth, or pose.
+- **Inpainting**: Edit specific parts of images.
+- **Upscaling**: Generate higher resolution images.
+
+## Related resources
+
+- [Endpoint functions guide](/flash/create-endpoints).
+- [Configuration reference](/flash/configuration/parameters).
+- [Managing Flash endpoints](/flash/managing-endpoints).
+- [Hugging Face diffusers documentation](https://huggingface.co/docs/diffusers/index).
+- [Stable Diffusion XL model card](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).
+- [Prompt engineering guide](https://huggingface.co/docs/diffusers/using-diffusers/write_good_prompt).
diff --git a/tutorials/flash/text-generation-with-transformers.mdx b/tutorials/flash/text-generation-with-transformers.mdx
new file mode 100644
index 00000000..e8166046
--- /dev/null
+++ b/tutorials/flash/text-generation-with-transformers.mdx
@@ -0,0 +1,467 @@
+---
+title: "Generate text with Flash and transformers"
+sidebarTitle: "Generate text with transformers"
+description: "Learn how to use Flash with Hugging Face transformers to build a GPU-accelerated text generation application."
+tag: "BETA"
+---
+
+This tutorial shows you how to build a text generation script using Flash and Hugging Face's transformers library. You'll learn how to load a pretrained language model on a GPU worker and generate text from prompts.
+
+## What you'll learn
+
+In this tutorial you'll learn how to:
+
+- Install and use the Hugging Face transformers library with Flash.
+- Load pretrained models on remote GPU workers.
+- Move models to GPU for faster inference.
+- Configure text generation parameters like temperature and max length.
+- Return structured results with metadata.
+
+## Requirements
+
+- You've [created a Runpod account](/get-started/manage-accounts).
+- You've [created a Runpod API key](/get-started/api-keys).
+- You've installed [Python 3.10 or higher](https://www.python.org/downloads/).
+- You've completed the [Flash quickstart](/flash/quickstart) or are familiar with Flash basics.
+
+## What you'll build
+
+By the end of this tutorial, you'll have a working text generation application that:
+
+- Accepts text prompts as input.
+- Generates natural language completions using GPT-2.
+- Runs entirely on Runpod's GPU infrastructure.
+- Returns generated text with execution metadata.
+
+## Step 1: Set up your project
+
+Create a new directory for your project and set up a Python virtual environment:
+
+```bash
+mkdir flash-text-generation
+cd flash-text-generation
+```
+
+Install Flash using [uv](https://docs.astral.sh/uv/):
+
+```bash
+uv venv
+source .venv/bin/activate
+uv pip install runpod-flash python-dotenv
+```
+
+Create a `.env` file with your Runpod API key:
+
+```bash
+touch .env && echo "RUNPOD_API_KEY=YOUR_API_KEY" > .env
+```
+
+Replace `YOUR_API_KEY` with your actual API key from the [Runpod console](https://www.runpod.io/console/user/settings).
+
+## Step 2: Understand the Hugging Face transformers library
+
+[Hugging Face transformers](https://huggingface.co/docs/transformers/index) is a popular Python library for working with pretrained language models. It provides:
+
+- **Thousands of pretrained models**: GPT-2, GPT-3, BERT, T5, LLaMA, and more
+- **Unified API**: Same code works across different model architectures
+- **Model hub integration**: Download models directly from [Hugging Face Hub](https://huggingface.co/models)
+- **Production-ready**: Used by companies and researchers worldwide
+
+For this tutorial, we'll use **GPT-2**, a 124M parameter language model from OpenAI. It's small enough to load quickly but powerful enough to generate coherent text.
+
+## Step 3: Create your project file
+
+Create a new file called `text_generation.py`:
+
+```bash
+touch text_generation.py
+```
+
+Open this file in your code editor. The following steps walk through building the text generation application.
+
+## Step 4: Add imports and configuration
+
+Add the necessary imports and Flash configuration:
+
+```python
+import asyncio
+from dotenv import load_dotenv
+from runpod_flash import Endpoint, GpuGroup
+
+# Load environment variables from .env file
+load_dotenv()
+```
+
+## Step 5: Define the text generation function
+
+Add the endpoint function that will run on the GPU worker:
+
+```python
+@Endpoint(
+ name="text-generation",
+ gpu=[GpuGroup.AMPERE_24, GpuGroup.ADA_24], # 24GB GPUs
+ workers=3,
+ idle_timeout=600, # 10 minutes
+ dependencies=["transformers", "torch", "accelerate"]
+)
+def generate_text(prompt, max_length=50):
+ """Generate text using a pretrained language model."""
+ import torch
+ from transformers import AutoTokenizer, AutoModelForCausalLM
+
+ # Load the GPT-2 model and tokenizer
+ model_name = "gpt2"
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
+ model = AutoModelForCausalLM.from_pretrained(model_name)
+
+ # Move model to GPU if available
+ device = "cuda" if torch.cuda.is_available() else "cpu"
+ device_name = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
+ model = model.to(device)
+
+ # Tokenize the input prompt
+ inputs = tokenizer(prompt, return_tensors="pt").to(device)
+
+ # Generate text
+ with torch.no_grad():
+ outputs = model.generate(
+ **inputs,
+ max_length=max_length,
+ num_return_sequences=1,
+ temperature=0.7,
+ do_sample=True,
+ pad_token_id=tokenizer.eos_token_id
+ )
+
+ # Decode the generated tokens back to text
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+
+ return {
+ "prompt": prompt,
+ "generated_text": generated_text,
+ "model_name": model_name,
+ "device": device,
+ "device_name": device_name,
+ "max_length": max_length
+ }
+```
+
+**Configuration breakdown**:
+
+- **`name="text-generation"`**: Identifies your endpoint in the Runpod console
+- **`gpu=[GpuGroup.AMPERE_24, GpuGroup.ADA_24]`**: Allows workers to use L4, A5000, RTX 3090, or RTX 4090 GPUs (all have 24GB VRAM)
+- **`workers=3`**: Allows up to 3 parallel workers for concurrent requests
+- **`idle_timeout=600`**: Keeps workers active for 10 minutes after last use (reduces cold starts)
+
+
+GPT-2 only requires about 2GB of VRAM, so 24GB GPUs are more than sufficient. For larger models like LLaMA or GPT-J, you might need 48GB or 80GB GPUs.
+
+
+This function:
+- Loads the GPT-2 model from Hugging Face.
+- Moves the model to the GPU.
+- Tokenizes the input prompt.
+- Generates text from the prompt.
+- Decodes the generated tokens back to text.
+- Returns the generated text and other metadata.
+
+Expand this section for a full breakdown:
+
+
+
+**Dependencies**: The function requires three packages:
+ - `transformers`: Hugging Face library for language models
+ - `torch`: PyTorch for GPU computation
+ - `accelerate`: Helper library for loading large models efficiently
+
+**Model loading**:
+ ```python
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
+ model = AutoModelForCausalLM.from_pretrained(model_name)
+ ```
+ These lines download and load the GPT-2 model from Hugging Face Hub. The first time this runs, it downloads ~500MB of model weights. Subsequent runs use the cached version.
+
+**GPU acceleration**:
+ ```python
+ device = "cuda" if torch.cuda.is_available() else "cpu"
+ model = model.to(device)
+ ```
+ This moves the model to GPU for faster inference. On Runpod workers, `torch.cuda.is_available()` returns `True`.
+
+**Tokenization**:
+ ```python
+ inputs = tokenizer(prompt, return_tensors="pt").to(device)
+ ```
+ Converts your text prompt into token IDs that the model understands. The `.to(device)` moves these tokens to GPU memory.
+
+**Generation parameters**:
+ - `max_length=50`: Maximum number of tokens to generate
+ - `temperature=0.7`: Controls randomness (0.0 = deterministic, 1.0+ = very random)
+ - `do_sample=True`: Use sampling instead of greedy decoding for more diverse outputs
+ - `num_return_sequences=1`: Generate one completion per prompt
+
+**No gradient tracking**:
+ ```python
+ with torch.no_grad():
+ ```
+ Disables gradient computation, reducing memory usage and speeding up inference.
+
+
+
+## Step 6: Add the main function
+
+Create the main function to test your text generator:
+
+```python
+async def main():
+ print("Starting text generation on Runpod GPU...")
+
+ # Define a prompt
+ prompt = "The future of artificial intelligence is"
+
+ # Generate text
+ result = await generate_text(prompt, max_length=100)
+
+ # Display results
+ print("\n" + "="*60)
+ print("TEXT GENERATION RESULTS")
+ print("="*60)
+ print(f"\nPrompt: {result['prompt']}")
+ print(f"\nGenerated text:\n{result['generated_text']}")
+ print("\n" + "-"*60)
+ print(f"Model: {result['model_name']}")
+ print(f"Device: {result['device']}")
+ print(f"GPU: {result['device_name']}")
+ print(f"Max length: {result['max_length']} tokens")
+ print("="*60)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+This main function:
+
+- Calls the remote function with `await` (runs asynchronously).
+- Waits for the GPU worker to complete text generation.
+- Displays the results in a formatted output.
+
+## Step 7: Run your first generation
+
+Run the application:
+
+```bash
+python text_generation.py
+```
+
+**First run output** (takes 60-90 seconds):
+
+```text
+Starting text generation on Runpod GPU...
+Creating endpoint: server_Endpoint_a1b2c3d4
+Provisioning Serverless endpoint...
+Endpoint ready
+Registering RunPod endpoint at https://api.runpod.ai/xvf32dan8rcilp
+Executing function on RunPod endpoint ID: xvf32dan8rcilp
+Initial job status: IN_QUEUE
+Installing dependencies: transformers torch accelerate
+Downloading model weights...
+Job completed, output received
+
+============================================================
+TEXT GENERATION RESULTS
+============================================================
+
+Prompt: The future of artificial intelligence is
+
+Generated text:
+The future of artificial intelligence is bright and full of possibilities. With advancements in machine learning and deep learning, we're seeing AI systems that can understand natural language, recognize images, and even create art. The potential applications are endless, from healthcare to transportation to education.
+
+------------------------------------------------------------
+Model: gpt2
+Device: cuda
+GPU: NVIDIA GeForce RTX 4090
+Max length: 100 tokens
+============================================================
+```
+
+**Subsequent runs** (takes 2-5 seconds):
+
+```text
+Starting text generation on Runpod GPU...
+Resource Endpoint_a1b2c3d4 already exists, reusing.
+Registering RunPod endpoint at https://api.runpod.ai/xvf32dan8rcilp
+Executing function on RunPod endpoint ID: xvf32dan8rcilp
+Initial job status: IN_QUEUE
+Job completed, output received
+
+[Results appear immediately]
+```
+
+Notice the dramatic speed improvement on subsequent runs—the endpoint is already provisioned, dependencies are installed, and the model is cached.
+
+## Step 8: Experiment with different prompts
+
+Modify the main function to try different prompts:
+
+```python
+async def main():
+ print("Starting text generation on Runpod GPU...")
+
+ # Try multiple prompts
+ prompts = [
+ "Once upon a time in a distant galaxy",
+ "The secret to happiness is",
+ "In the year 2050, technology will"
+ ]
+
+ for prompt in prompts:
+ print(f"\n{'='*60}")
+ print(f"Generating for: {prompt}")
+ print('='*60)
+
+ result = await generate_text(prompt, max_length=80)
+ print(f"\n{result['generated_text']}\n")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+Run it again:
+
+```bash
+python text_generation.py
+```
+
+You'll see three different completions generated sequentially on the same GPU worker.
+
+## Troubleshooting
+
+### Model download fails
+
+**Issue**: `Error: Failed to download model from Hugging Face`.
+
+**Solutions**:
+1. Check internet connectivity from workers (rare issue on Runpod).
+2. Try a different model that might be available faster.
+3. Increase execution timeout in configuration:
+ ```python
+ @Endpoint(
+ name="text-generation",
+ gpu=GpuGroup.ADA_24,
+ execution_timeout_ms=300000 # 5 minutes
+ )
+ ```
+
+### Out of memory error
+
+**Issue**: `RuntimeError: CUDA out of memory`.
+
+**Solutions**:
+1. Use smaller models (GPT-2 instead of GPT-2 Large).
+2. Reduce `max_length` parameter.
+3. Use larger GPUs:
+ ```python
+ gpu=GpuGroup.AMPERE_48 # 48GB GPUs
+ ```
+
+### Slow generation
+
+**Issue**: Text generation takes >30 seconds per request.
+
+**Possible causes**:
+1. Worker scaled down (cold start).
+2. Model not cached.
+3. Large `max_length` value.
+
+**Solutions**:
+1. Increase `idle_timeout` to keep workers active:
+ ```python
+ idle_timeout=1800 # Keep active for 30 minutes
+ ```
+2. Set `workers=(1, 3)` to always have a warm worker ready.
+3. Reduce `max_length` to generate fewer tokens.
+
+### Generation quality is poor
+
+**Issue**: Generated text is incoherent or repetitive.
+
+**Solutions**:
+1. Adjust `temperature` (try 0.7-0.9)
+2. Add `top_p` and `top_k` sampling:
+ ```python
+ outputs = model.generate(
+ **inputs,
+ max_length=max_length,
+ temperature=0.8,
+ top_p=0.9,
+ top_k=50,
+ do_sample=True
+ )
+ ```
+3. Try a larger model (GPT-2 Medium or Large).
+
+## Next steps
+
+Now that you've built a text generation script with Flash, you can:
+
+### Explore other models
+
+Try different models from Hugging Face:
+
+```python
+# Instruction-following model
+model_name = "facebook/opt-1.3b"
+
+# Code generation model
+model_name = "Salesforce/codegen-350M-mono"
+
+# Dialogue model
+model_name = "microsoft/DialoGPT-medium"
+```
+
+### Build a chat interface
+
+Extend your app to handle multi-turn conversations:
+
+```python
+@Endpoint(
+ name="chat",
+ gpu=GpuGroup.ADA_24,
+ dependencies=["transformers", "torch"]
+)
+def chat(conversation_history):
+ """Multi-turn chat with context."""
+ # Concatenate conversation history
+ prompt = "\n".join(conversation_history)
+ # Generate response
+ # Return new message
+```
+
+### Deploy as a Flash app
+
+Convert your script to a production [Flash app](/flash/apps/overview):
+
+```bash
+flash init text-generation-app
+# Move your function to workers/gpu/endpoint.py
+# Add FastAPI routes
+flash deploy
+```
+
+
+When deploying queue-based functions with `flash deploy`, each function must have its own unique endpoint configuration. If your script has multiple functions sharing the same config (like `generate_text` and `chat` in this tutorial), create separate endpoints for each function when converting to a Flash app. See [understanding endpoint architecture](/flash/apps/deploy-apps#understanding-endpoint-architecture) for details.
+
+
+### Optimize performance
+
+- Use [network volumes](/flash/configuration/storage) to cache models across workers.
+- Implement [request batching](/flash/create-endpoints#parallel-execution) for higher throughput.
+- Try [quantized models](https://huggingface.co/docs/transformers/main_classes/quantization) for faster inference.
+
+## Related resources
+
+- [Endpoint functions guide](/flash/create-endpoints).
+- [Configuration reference](/flash/configuration/parameters).
+- [Managing Flash endpoints](/flash/managing-endpoints).
+- [Hugging Face transformers documentation](https://huggingface.co/docs/transformers/index).
+- [Hugging Face model hub](https://huggingface.co/models).
diff --git a/tutorials/introduction/overview.mdx b/tutorials/introduction/overview.mdx
index eff9dee7..cfdbcd15 100644
--- a/tutorials/introduction/overview.mdx
+++ b/tutorials/introduction/overview.mdx
@@ -2,53 +2,69 @@
title: "Overview"
sidebarTitle: "Overview"
description: "Step-by-step guides for building and deploying AI/ML applications on Runpod."
+mode: "wide"
---
+
+
This section includes step-by-step guides to help you build and deploy example applications on the Runpod platform, covering basic concepts and advanced implementations.
## Serverless
-
-
+
+
Deploy a Stable Diffusion endpoint and generate your first AI image using Serverless.
-
+
Deploy an image generation endpoint and integrate it into a web application.
-
+
Learn how to create a custom Serverless endpoint that uses model caching to serve a large language model with reduced cost and cold start times.
-
+
Deploy a Serverless endpoint with Google's Gemma 3 model using vLLM and the OpenAI API to build an interactive chatbot.
-
+
Deploy ComfyUI on Serverless and generate images using JSON workflows.
+## Flash
+
+
+ Deploy SDXL as a serverless endpoint with Python decorators.
+
+
+ Deploy a text generation model on Runpod.
+
+
+ Create a REST API with automatic load balancing using Flash.
+
+
+
## Pods
-
-
+
+
Launch JupyterLab on a GPU Pod and run LLM inference using the Python `transformers` library.
-
+
Deploy Ollama on a GPU Pod and run LLM inference using the Ollama API.
-
+
Build Docker images on Pods using Bazel.
-
+
Deploy ComfyUI on a GPU Pod and generate images using the ComfyUI web interface.
## Public Endpoints
-
-
+
+
Chain multiple Public Endpoints to generate videos from text prompts using Python.
-
+
\ No newline at end of file
diff --git a/tutorials/serverless/model-caching-text.mdx b/tutorials/serverless/model-caching-text.mdx
index a9c68975..56a50e99 100644
--- a/tutorials/serverless/model-caching-text.mdx
+++ b/tutorials/serverless/model-caching-text.mdx
@@ -462,16 +462,16 @@ Now that you have a working Phi-3 endpoint with cached models, you can:
## Related resources
-
+
Learn more about cached models and their benefits
-
+
Deploy workers directly from GitHub repositories
-
+
Understand handler function structure and best practices
-
+
Explore vLLM for optimized LLM inference