SGLang + Megatron: A verl-Style Hybrid Engine for ART's RL Training Pipeline

Result: SGLang wins decisively — 3.9x throughput, 2.3x faster ITL, 52% less tail latency, 29% less peak GPU memory, 3.4x faster startup. Zero errors on both sides.

What I Studied

I went through the ART (Agent Reinforcement Trainer) codebase by OpenPipe to understand how it handles the inference-training lifecycle for reinforcement learning. ART uses vLLM as its inference engine and Megatron for gradient computation, switching between them via a sleep/wake mechanism.

After reading through the key files — src/art/megatron/backend.py, src/art/megatron/service.py, and src/art/unsloth/service.py — I identified the following core flow:

Rollout: vLLM generates completions for a batch of prompts (via in-process AsyncLLM Python API).
Sleep: do_sleep(level=2) releases vLLM's KV cache and model weights from GPU using CUDA VMM unmap_and_release().
Train: Megatron subprocess runs a training step (LoRA + optimizer), producing updated adapter weights.
Wake: do_wake_up() restores weights and KV cache back to GPU via create_and_map().
Weight sync: Updated LoRA adapter is loaded via in-process add_lora().

The critical observation: vLLM and Megatron share the same CUDA context because vLLM runs in-process. When vLLM wakes up, Megatron is still alive and holding GPU memory — so vLLM gets a smaller KV cache pool than it originally had. This is a known issue (vLLM RFC #15254, 17 thumbs-up from core devs).

What I Built

I built an alternative backend that replaces vLLM with SGLang, following the verl (Volcano Engine RL) integration pattern. The key design decision: run SGLang as a separate process with its own CUDA context, communicating via HTTP instead of in-process Python calls.

The implementation consists of three main files:

File	Purpose
`sglang_server.py`	Server lifecycle management (start once, sleep/wake via HTTP, LoRA hot-reload)
`sglang_megatron_service.py`	The sleep → train → wake → load_lora lifecycle
`sglang_megatron_backend.py`	Backend class that inherits from ART's `LocalBackend`

The lifecycle I implemented:

Rollout: SGLang generates completions via OpenAI-compatible HTTP API.
Sleep: HTTP /release_memory_occupation — SGLang frees GPU memory at the OS/driver level. Since it's a separate process, this is a clean release.
Train: Same Megatron subprocess as ART — identical code, identical optimizer. Now it gets the full GPU because SGLang truly freed its memory.
Wake: HTTP /resume_memory_occupation — SGLang re-allocates based on what's actually free.
Weight sync: HTTP /load_lora_adapter loads the ~2 MB LoRA adapter in <2s. I also wrote the TP-shard merging logic (_merge_lora_shards) to correctly handle column-parallel vs row-parallel layers.

I took inspiration from the verl project for the server-starts-once, sleep/wake via HTTP, LoRA hot-reload pattern.

Architecture Comparison

┌─────────────────────────────────────┐    ┌─────────────────────────────────────┐
│  ART's vLLM (what I studied)        │    │  My SGLang Backend                  │
│                                     │    │                                     │
│  ┌─ Single Process ───────────────┐ │    │  ┌─ Process 1 ──────────────────┐  │
│  │  Shared CUDA Context           │ │    │  │  Independent CUDA Context    │  │
│  │                                │ │    │  │                              │  │
│  │  ┌──────────────────────────┐  │ │    │  │  ┌────────────────────────┐  │  │
│  │  │  vLLM AsyncLLM Engine   │  │ │    │  │  │  SGLang Server         │  │  │
│  │  │  (in-process Python API) │  │ │    │  │  │  (HTTP API, persistent)│  │  │
│  │  └──────────────────────────┘  │ │    │  │  └────────────────────────┘  │  │
│  │  ┌────────────┐ ┌───────────┐  │ │    │  │  ┌────────────┐ ┌────────┐  │  │
│  │  │  KV Cache  │ │  Weights  │  │ │    │  │  │ RadixAttn  │ │  LoRA  │  │  │
│  │  │ CUDA VMM   │ │  GPU      │  │ │    │  │  │ KV Cache   │ │ <2s    │  │  │
│  │  └────────────┘ └───────────┘  │ │    │  │  └────────────┘ └────────┘  │  │
│  │                                │ │    │  └──────────────────────────────┘  │
│  │  ┌──────────────────────────┐  │ │    │                                     │
│  │  │  Megatron Subprocess     │  │ │    │  ┌─ Process 2 ──────────────────┐  │
│  │  │  (stays alive, holds GPU)│  │ │    │  │  Megatron Training           │  │
│  │  │  LoRA + Optimizer States │  │ │    │  │  (gets full GPU after sleep) │  │
│  │  └──────────────────────────┘  │ │    │  │  LoRA + Optimizer States     │  │
│  └────────────────────────────────┘ │    │  └──────────────────────────────┘  │
│                                     │    │                                     │
│  ▲ 53 GB lost after 1st cycle       │    │  ✓ Full memory recovery every step  │
└─────────────────────────────────────┘    └─────────────────────────────────────┘

Aspect	ART's vLLM	My SGLang Backend
Process model	In-process	Separate process
Sleep/wake	`do_sleep(level=2)`	HTTP `/release_memory`
Memory recovery	53 GB lost permanently	Full recovery each step
Weight sync	In-process `add_lora()`	HTTP LoRA hot-reload (<2s)
KV cache	Standard prefix cache	RadixAttention
Startup	~182s	~53s (3.4x faster)
Training engine	Megatron (identical — same code on both)	Megatron (identical — same code on both)

Why the Results Are What They Are

The Memory Problem I Found in ART

After the first sleep/wake cycle, ART's vLLM loses ~53 GB of GPU memory and never gets it back. I verified this by monitoring nvidia-smi across steps:

Step 1: 190.4 GB  →  Step 2: 136.6 GB  →  Step 10: 139.1 GB

This is not a bug in my benchmark — it's a known limitation. vLLM RFC #15254 ("Better support for weight updating while waking up from sleep mode for RLHF") documents exactly this. The verl project reports the same thing (verl#302). The root cause is that Megatron's subprocess stays alive during wake, consuming the 53 GB that vLLM can't reclaim.

This directly causes a 29% throughput drop: 784 tok/s at step 1 → ~555 tok/s at step 2 onward.

Why My Architecture Avoids It

Since SGLang runs as a separate process, /release_memory_occupation frees GPU memory at the OS/driver level — not just within a shared CUDA context. Megatron gets the full GPU during training. When SGLang re-allocates after training, it sees the actual free memory and allocates accordingly. Result: stable 133–135 GB across all 10 steps.

The MoE Factor

The 3.9x throughput gap is not universal — it's amplified by the Mixture-of-Experts architecture. I cross-referenced with published benchmarks:

Source	Model	SGLang Speedup
vLLM Issue #18136	Qwen3-32B-AWQ (MoE, 4xA10G)	4.2x
LMSYS Benchmark	Llama-70B (dense)	3.1x
Tim Wang Blog (H100)	Llama-3.1-8B (dense, 1 GPU)	~1.0x
RTX 4090 Benchmark	Llama-3.2-3B (dense)	~2x
My result	Qwen3-30B-A3B (MoE, TP=2)	3.9x

My 3.9x on Qwen3-30B-A3B aligns with the 4.2x on Qwen3-32B-AWQ. On dense single-GPU models, the gap disappears. This confirms it's an MoE + multi-GPU architectural advantage, not an artifact of my benchmark setup.

Other Contributing Factors

RadixAttention: During each rollout, 32 concurrent requests share a system prompt prefix. SGLang deduplicates the KV computation automatically — 1 request computes it, the rest reuse it.

LoRA hot-reload: My weight sync loads a ~2 MB adapter via HTTP in <2s. ART's old path built a 60 GB merged model directory, taking 464s per step. I followed ART's own recommended weight_sync_method="lora" for this.

Benchmark Results

Setup: Qwen3-30B-A3B-Instruct-2507, GSM8K dataset (1,319 real questions downloaded from OpenAI's repo), 64 requests per step, TP=2, 4xA100, 10 RL training steps.

Summary

Metric	ART's vLLM	My SGLang	Delta
Total time	1,553s	1,210s	-22%
Server startup	182s	53s	3.4x faster
Avg throughput	582 tok/s	2,271 tok/s	3.9x faster
Avg ITL	31.9 ms	13.9 ms	2.3x faster
Avg p99 latency	29.5s	14.1s	-52%
Avg GPU memory	143.3 GB	133.4 GB	-7%
Peak GPU memory	190.4 GB	135.2 GB	-29%
Total errors	0	0	tie

Throughput per RL Step

tok/s
2800 ┤
     │
2400 ┤  ●───●───●───●───●───●───●───●───●   ← My SGLang (~2,430 avg)
     │
2000 ┤
     │                                         3.9x gap
1600 ┤
     │
1200 ┤
     │
 800 ┤●
     │ ╲
 550 ┤  ■───■───■───■───■───■───■───■───■   ← ART's vLLM (~560 avg)
     │
   0 ┼──┬───┬───┬───┬───┬───┬───┬───┬───┬
     1   2   3   4   5   6   7   8   9  10   Step

ART's vLLM drops from 784 to 555 tok/s after step 1 (29% degradation). My SGLang backend ramps to ~2,400 tok/s by step 2 and stays there.

GPU Memory Usage

 GB
200 ┤■                                        ← vLLM peak (190.4 GB)
    │ ╲  53 GB lost (never recovered)
180 ┤  ╲
    │   ╲
160 ┤    ╲
    │     ╲
140 ┤      ■───■───■───■───■───■───■───■───■  ← ART's vLLM (~137 GB)
    │   ●───●───●───●───●───●───●───●───●     ← My SGLang (~134 GB)
120 ┤●
    │
100 ┼──┬───┬───┬───┬───┬───┬───┬───┬───┬
    1   2   3   4   5   6   7   8   9  10    Step

Confidence Assessment

Metric	Matches published data?	Confidence
Throughput 3.9x	Yes — MoE-specific (cf. 4.2x on Qwen3-32B)	Medium-High
Startup 3.4x	Yes — CUDA graph compilation difference	High
ITL 2.3x	Yes — MoE models specifically	Medium-High
p99 -52%	Consistent with memory degradation (RFC #15254)	Medium
Peak GPU -29%	Well-documented in SGLang benchmarks	High
Wall time -22%	Composite metric, expected from above	High

Running the Benchmark

# SGLang
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python benchmarks/sglang_vs_vllm/run_benchmark.py \
    --sglang-python ~/.venvs/sglang-bench/bin/python \
    --tp 2 --num-steps 10 --num-rollouts 64 --backends sglang --dataset gsm8k

# vLLM
CUDA_VISIBLE_DEVICES=0,1,2,3 uv run python benchmarks/sglang_vs_vllm/run_benchmark.py \
    --sglang-python ~/.venvs/sglang-bench/bin/python \
    --tp 2 --num-steps 10 --num-rollouts 64 --backends vllm --dataset gsm8k

GSM8K test set (1,319 questions) is downloaded automatically on first run and cached locally.

Credits

Both backends use the exact same Megatron subprocess for training — same code, same data, same optimizer. I only changed the inference engine and how it manages GPU memory between rollout and training phases.

ART (OpenPipe) — The codebase I studied and built on top of
verl (Volcano Engine) — Primary reference for the SGLang integration pattern
SGLang — Inference engine
vLLM — ART's default inference engine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGLang + Megatron: A verl-Style Hybrid Engine for ART's RL Training Pipeline

What I Studied

What I Built

Architecture Comparison

Why the Results Are What They Are

The Memory Problem I Found in ART

Why My Architecture Avoids It

The MoE Factor

Other Contributing Factors

Benchmark Results

Summary

Throughput per RL Step

GPU Memory Usage

Confidence Assessment

Running the Benchmark

Credits

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

SGLang + Megatron: A verl-Style Hybrid Engine for ART's RL Training Pipeline

What I Studied

What I Built

Architecture Comparison

Why the Results Are What They Are

The Memory Problem I Found in ART

Why My Architecture Avoids It

The MoE Factor

Other Contributing Factors

Benchmark Results

Summary

Throughput per RL Step

GPU Memory Usage

Confidence Assessment

Running the Benchmark

Credits