Result: SGLang wins decisively — 3.9x throughput, 2.3x faster ITL, 52% less tail latency, 29% less peak GPU memory, 3.4x faster startup. Zero errors on both sides.
I went through the ART (Agent Reinforcement Trainer) codebase by OpenPipe to understand how it handles the inference-training lifecycle for reinforcement learning. ART uses vLLM as its inference engine and Megatron for gradient computation, switching between them via a sleep/wake mechanism.
After reading through the key files — src/art/megatron/backend.py, src/art/megatron/service.py, and src/art/unsloth/service.py — I identified the following core flow:
- Rollout: vLLM generates completions for a batch of prompts (via in-process
AsyncLLMPython API). - Sleep:
do_sleep(level=2)releases vLLM's KV cache and model weights from GPU using CUDA VMMunmap_and_release(). - Train: Megatron subprocess runs a training step (LoRA + optimizer), producing updated adapter weights.
- Wake:
do_wake_up()restores weights and KV cache back to GPU viacreate_and_map(). - Weight sync: Updated LoRA adapter is loaded via in-process
add_lora().
The critical observation: vLLM and Megatron share the same CUDA context because vLLM runs in-process. When vLLM wakes up, Megatron is still alive and holding GPU memory — so vLLM gets a smaller KV cache pool than it originally had. This is a known issue (vLLM RFC #15254, 17 thumbs-up from core devs).
I built an alternative backend that replaces vLLM with SGLang, following the verl (Volcano Engine RL) integration pattern. The key design decision: run SGLang as a separate process with its own CUDA context, communicating via HTTP instead of in-process Python calls.
The implementation consists of three main files:
| File | Purpose |
|---|---|
sglang_server.py |
Server lifecycle management (start once, sleep/wake via HTTP, LoRA hot-reload) |
sglang_megatron_service.py |
The sleep → train → wake → load_lora lifecycle |
sglang_megatron_backend.py |
Backend class that inherits from ART's LocalBackend |
The lifecycle I implemented:
- Rollout: SGLang generates completions via OpenAI-compatible HTTP API.
- Sleep: HTTP
/release_memory_occupation— SGLang frees GPU memory at the OS/driver level. Since it's a separate process, this is a clean release. - Train: Same Megatron subprocess as ART — identical code, identical optimizer. Now it gets the full GPU because SGLang truly freed its memory.
- Wake: HTTP
/resume_memory_occupation— SGLang re-allocates based on what's actually free. - Weight sync: HTTP
/load_lora_adapterloads the ~2 MB LoRA adapter in <2s. I also wrote the TP-shard merging logic (_merge_lora_shards) to correctly handle column-parallel vs row-parallel layers.
I took inspiration from the verl project for the server-starts-once, sleep/wake via HTTP, LoRA hot-reload pattern.
┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐
│ ART's vLLM (what I studied) │ │ My SGLang Backend │
│ │ │ │
│ ┌─ Single Process ───────────────┐ │ │ ┌─ Process 1 ──────────────────┐ │
│ │ Shared CUDA Context │ │ │ │ Independent CUDA Context │ │
│ │ │ │ │ │ │ │
│ │ ┌──────────────────────────┐ │ │ │ │ ┌────────────────────────┐ │ │
│ │ │ vLLM AsyncLLM Engine │ │ │ │ │ │ SGLang Server │ │ │
│ │ │ (in-process Python API) │ │ │ │ │ │ (HTTP API, persistent)│ │ │
│ │ └──────────────────────────┘ │ │ │ │ └────────────────────────┘ │ │
│ │ ┌────────────┐ ┌───────────┐ │ │ │ │ ┌────────────┐ ┌────────┐ │ │
│ │ │ KV Cache │ │ Weights │ │ │ │ │ │ RadixAttn │ │ LoRA │ │ │
│ │ │ CUDA VMM │ │ GPU │ │ │ │ │ │ KV Cache │ │ <2s │ │ │
│ │ └────────────┘ └───────────┘ │ │ │ │ └────────────┘ └────────┘ │ │
│ │ │ │ │ └──────────────────────────────┘ │
│ │ ┌──────────────────────────┐ │ │ │ │
│ │ │ Megatron Subprocess │ │ │ │ ┌─ Process 2 ──────────────────┐ │
│ │ │ (stays alive, holds GPU)│ │ │ │ │ Megatron Training │ │
│ │ │ LoRA + Optimizer States │ │ │ │ │ (gets full GPU after sleep) │ │
│ │ └──────────────────────────┘ │ │ │ │ LoRA + Optimizer States │ │
│ └────────────────────────────────┘ │ │ └──────────────────────────────┘ │
│ │ │ │
│ ▲ 53 GB lost after 1st cycle │ │ ✓ Full memory recovery every step │
└─────────────────────────────────────┘ └─────────────────────────────────────┘
| Aspect | ART's vLLM | My SGLang Backend |
|---|---|---|
| Process model | In-process | Separate process |
| Sleep/wake | do_sleep(level=2) |
HTTP /release_memory |
| Memory recovery | 53 GB lost permanently | Full recovery each step |
| Weight sync | In-process add_lora() |
HTTP LoRA hot-reload (<2s) |
| KV cache | Standard prefix cache | RadixAttention |
| Startup | ~182s | ~53s (3.4x faster) |
| Training engine | Megatron (identical — same code on both) | Megatron (identical — same code on both) |
After the first sleep/wake cycle, ART's vLLM loses ~53 GB of GPU memory and never gets it back. I verified this by monitoring nvidia-smi across steps:
Step 1: 190.4 GB → Step 2: 136.6 GB → Step 10: 139.1 GB
This is not a bug in my benchmark — it's a known limitation. vLLM RFC #15254 ("Better support for weight updating while waking up from sleep mode for RLHF") documents exactly this. The verl project reports the same thing (verl#302). The root cause is that Megatron's subprocess stays alive during wake, consuming the 53 GB that vLLM can't reclaim.
This directly causes a 29% throughput drop: 784 tok/s at step 1 → ~555 tok/s at step 2 onward.
Since SGLang runs as a separate process, /release_memory_occupation frees GPU memory at the OS/driver level — not just within a shared CUDA context. Megatron gets the full GPU during training. When SGLang re-allocates after training, it sees the actual free memory and allocates accordingly. Result: stable 133–135 GB across all 10 steps.
The 3.9x throughput gap is not universal — it's amplified by the Mixture-of-Experts architecture. I cross-referenced with published benchmarks:
| Source | Model | SGLang Speedup |
|---|---|---|
| vLLM Issue #18136 | Qwen3-32B-AWQ (MoE, 4xA10G) | 4.2x |
| LMSYS Benchmark | Llama-70B (dense) | 3.1x |
| Tim Wang Blog (H100) | Llama-3.1-8B (dense, 1 GPU) | ~1.0x |
| RTX 4090 Benchmark | Llama-3.2-3B (dense) | ~2x |
| My result | Qwen3-30B-A3B (MoE, TP=2) | 3.9x |
My 3.9x on Qwen3-30B-A3B aligns with the 4.2x on Qwen3-32B-AWQ. On dense single-GPU models, the gap disappears. This confirms it's an MoE + multi-GPU architectural advantage, not an artifact of my benchmark setup.
RadixAttention: During each rollout, 32 concurrent requests share a system prompt prefix. SGLang deduplicates the KV computation automatically — 1 request computes it, the rest reuse it.
LoRA hot-reload: My weight sync loads a ~2 MB adapter via HTTP in <2s. ART's old path built a 60 GB merged model directory, taking 464s per step. I followed ART's own recommended weight_sync_method="lora" for this.
Setup: Qwen3-30B-A3B-Instruct-2507, GSM8K dataset (1,319 real questions downloaded from OpenAI's repo), 64 requests per step, TP=2, 4xA100, 10 RL training steps.
| Metric | ART's vLLM | My SGLang | Delta |
|---|---|---|---|
| Total time | 1,553s | 1,210s | -22% |
| Server startup | 182s | 53s | 3.4x faster |
| Avg throughput | 582 tok/s | 2,271 tok/s | 3.9x faster |
| Avg ITL | 31.9 ms | 13.9 ms | 2.3x faster |
| Avg p99 latency | 29.5s | 14.1s | -52% |
| Avg GPU memory | 143.3 GB | 133.4 GB | -7% |
| Peak GPU memory | 190.4 GB | 135.2 GB | -29% |
| Total errors | 0 | 0 | tie |
tok/s
2800 ┤
│
2400 ┤ ●───●───●───●───●───●───●───●───● ← My SGLang (~2,430 avg)
│
2000 ┤
│ 3.9x gap
1600 ┤
│
1200 ┤
│
800 ┤●
│ ╲
550 ┤ ■───■───■───■───■───■───■───■───■ ← ART's vLLM (~560 avg)
│
0 ┼──┬───┬───┬───┬───┬───┬───┬───┬───┬
1 2 3 4 5 6 7 8 9 10 Step
ART's vLLM drops from 784 to 555 tok/s after step 1 (29% degradation). My SGLang backend ramps to ~2,400 tok/s by step 2 and stays there.
GB
200 ┤■ ← vLLM peak (190.4 GB)
│ ╲ 53 GB lost (never recovered)
180 ┤ ╲
│ ╲
160 ┤ ╲
│ ╲
140 ┤ ■───■───■───■───■───■───■───■───■ ← ART's vLLM (~137 GB)
│ ●───●───●───●───●───●───●───●───● ← My SGLang (~134 GB)
120 ┤●
│
100 ┼──┬───┬───┬───┬───┬───┬───┬───┬───┬
1 2 3 4 5 6 7 8 9 10 Step
| Metric | Matches published data? | Confidence |
|---|---|---|
| Throughput 3.9x | Yes — MoE-specific (cf. 4.2x on Qwen3-32B) | Medium-High |
| Startup 3.4x | Yes — CUDA graph compilation difference | High |
| ITL 2.3x | Yes — MoE models specifically | Medium-High |
| p99 -52% | Consistent with memory degradation (RFC #15254) | Medium |
| Peak GPU -29% | Well-documented in SGLang benchmarks | High |
| Wall time -22% | Composite metric, expected from above | High |
# SGLang
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python benchmarks/sglang_vs_vllm/run_benchmark.py \
--sglang-python ~/.venvs/sglang-bench/bin/python \
--tp 2 --num-steps 10 --num-rollouts 64 --backends sglang --dataset gsm8k
# vLLM
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python benchmarks/sglang_vs_vllm/run_benchmark.py \
--sglang-python ~/.venvs/sglang-bench/bin/python \
--tp 2 --num-steps 10 --num-rollouts 64 --backends vllm --dataset gsm8kGSM8K test set (1,319 questions) is downloaded automatically on first run and cached locally.
Both backends use the exact same Megatron subprocess for training — same code, same data, same optimizer. I only changed the inference engine and how it manages GPU memory between rollout and training phases.
- ART (OpenPipe) — The codebase I studied and built on top of
- verl (Volcano Engine) — Primary reference for the SGLang integration pattern
- SGLang — Inference engine
- vLLM — ART's default inference engine