+
+
+
+
+
+
+ gpt-oss 120B
+ ▾
+
+
+
+
+
+
+ 1K / 1K
+ ▾
+
+
+
+
+
+
+ FP4
+ ▾
+
+
+
+
+
+
+ Token Throughput per GPU
+ ▾
+
+
+
+
+
+
+
+ MI355X (SGLang)
+ ✕
+
+
+ ▾
+
+
+
+
+
+
+
+
+ B200 (SGLang)
+ ✕
+
+
+ ▾
+
+
+
+
+
+
+
+
+
+
+
+
+
OpenAI
+
120B total · 5B active · 131K context
+
+
+
+
Token Embedding
+
d = 2,880 · vocab = 201,088
+
+
↓
+
+
Sliding Attention + Sink
+
×18 layers · Top-4/128 MoE
+
+
+
+
↕ alternating every layer
+
+
Causal Grouped Query Attention
+
×18 layers · Top-4/128 MoE
+
+
+
+
↓
+
+
↓
+
+
Output Head (LM Head)
+
vocab = 201,088
+
+
+
+
+
+
+
+
+ Features:
+ Alternating Sliding/Full Attention
+ Attention Sink Tokens
+ YaRN RoPE (factor=32)
+ MXFP4 Quantization
+
+
Released by OpenAI on Jun 13, 2025
+
+
+
+
+
+
+ Latest
+ ‹
+ 📅
+ Run Date: Mar 04, 2026
+ ›
+
+ ‹
+ Run 1/1 ↗
+ ›
+
+ Changelog ▾
+
+
+
+
+
+
+
+
+
+
+
+
🏗️
+
Building…
+
This metric view is under construction. Data will be available soon.
+
+
+
+
Curve Legend
+
+ B200 (vLLM)
+ MI355X (vLLM)
+
+
+
+
+
+
+
+
+
+
+
+
+
🏗️
+
Building…
+
This metric view is under construction. Data will be available soon.
+
+
+
+
Curve Legend
+
+ B200 (vLLM)
+ MI355X (vLLM)
+
+
+
+
+
+
+
+
+
+
⚠ Kernel comparison does not include inter-kernel bubbles and latency saving from parallel kernel execution.
+
+
+
+
+ GEMM/MOE
+ ATTN
+ AR/NORM
+ QUANT
+
+
+ TOPK
+ ACT
+ CACHE
+ other
+
+
+
+
+
+
+
+
+
Status & Hardware
+
+ MI355X (Before)
+ MI355X (Roofline)
+ MI355X (After Opt)
+ B200 (Reference)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ▾
+
+
+
+
+
+
+
+
+ ▾
+
+
+
+
+
+
+
+ ▾
+
+
+
+
+
+
+
+ ▾
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Input Throughput/GPU ×avg
+ —
+
+
+ Output Throughput/GPU ×avg
+ —
+
+
+ TTFT ×avg
+ —
+
+
+ ITL ×avg
+ —
+
+
+ Total Latency ×avg
+ —
+
+
+ Gap vs B200
+ —
+
+
+ Gap vs Roofline
+ —
+
+
+
+
+
+
+
+
+
+
+
+
+
+ 📊
+ No profiling data. Configure parameters above and click Start Analysis.
+
+
+
+
+
+
+
+
+
+
+ 💡
+ No insights yet. Run analysis to see key findings.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Optimization Plan:
+
+
+ 🔍 Search Model…
+
+ ▾
+
+
+
+
+
+
+
+
+
+
+
+ 📋
+ No model selected. Choose a model from the dropdown above to view the optimization plan.
+
+
+
+
+
+ 🔨
+ Optimization plan for gpt-oss 120B is currently being built…
+ Kernel profiling and strategy generation in progress. Check back soon.
+
+
+
+
+
+
+
+
+
+
+
+
+ ☑
+ P1
+ MoE FP4 GEMM Opt. BW 59% · 756 ops
+ -171.5ms
+ 24.0% 714.8→543.3ms
+ ▾
+
+
+
+ ☑
+ fp4_moe_gemm_kernel
+ BW: 2487 GB/s · Compute: 1200 TFLOPS
+
+
+ ☑
+ fp4_dequant_gemm
+ BW: 2240 GB/s · Compute: 2100 TFLOPS
+
+
+ ☐
+ moe_gate_topk
+ BW: 3200 GB/s · Compute: 2800 TFLOPS
+
+
+
+
+
+
+
+ ☑
+ P2
+ MLA Flash Attention Opt. BW 53% · latent projection overhead
+ -26.0ms
+ 3.6% 543.3→517.3ms
+ ▾
+
+
+
+ ☑
+ flash_attn_fwd_v2
+ BW: 2800 GB/s · Compute: 2600 TFLOPS
+
+
+ ☐
+ mla_absorb_proj
+ BW: 2480 GB/s · Compute: 2200 TFLOPS
+
+
+
+
+
+
+
+ ☐
+ P3
+ MLA QKV Absorb Projection suboptimal N=512 tiling · 244 ops
+ -15.8ms
+ 2.2% 517.3→501.5ms
+ ▸
+
+
+
+ ☐
+ absorb_attn_qkv
+ BW: 2240 GB/s · Compute: 1960 TFLOPS
+
+
+ ☐
+ absorb_output_proj
+ BW: 4100 GB/s · Compute: 2100 TFLOPS
+
+
+
+
+
+
+
+ ☐
+ P4
+ SwiGLU + Elementwise Fusion unfused silu + mul · 122 ops
+ -10.2ms
+ 1.4% 501.5→491.3ms
+ ▸
+
+
+
+ ☐
+ silu_activation
+ BW: 3840 GB/s · Compute: 180 TFLOPS
+
+
+ ☐
+ elementwise_mul
+ BW: 4480 GB/s · Compute: 160 TFLOPS
+
+
+
+
+
+
+
+ ☐
+ P5
+ RMSNorm Triton Fusion 8-kernel decomposition · 984 ops
+ -4.7ms
+ 0.7% 491.3→486.6ms
+ ▸
+
+
+
+ ☐
+ rms_norm_triton
+ BW: 3840 GB/s · Compute: 312 TFLOPS
+
+
+ ☐
+ residual_add_kernel
+ BW: 3600 GB/s · Compute: 280 TFLOPS
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ▾
+ fp4_moe_gemm_kernel
+ MoE GEMM Opt. · 2m 34s
+ Validated ✅
+
+
+
+
+ BW: 31% (pre-opt)
+ Est. -171.5ms
+ Speedup: 2.4×
+ E2E: 24.0% ▲
+
+
+
+
+ ▸
+ Strategy Details
+ Est. 2.4×
+
+
+
Optimize BW utilization 31% → 75% · Memory layout + tiling
+
+
+ Strategy:
+ Fuse memory layout + loop tiling + vectorized load
+
+
+ Target:
+ BW util 31% → 75%, reduce memory round-trips by 60%
+
+
+
+
+
+
+
+
+
+ ▾
+ fp4_dequant_gemm
+ MoE GEMM Opt. · 1m 12s
+ Processing 🔄
+
+
+
+
+ Fuse dequant + GEMM
+ Est. -26.0ms
+ Iter 3/8
+
+
+
+
+ ▸
+ Strategy Details
+ Est. 1.8×
+
+
+
Fuse dequantize + GEMM · Reduce memory round-trips
+
+
+ Strategy:
+ Kernel fusion (dequant → GEMM) + shared memory staging
+
+
+ Target:
+ Eliminate intermediate buffer write, reduce BW 40%
+
+
+
+
+
+
+
+
+
+ ▸
+ flash_attn_fwd_v2
+ MLA Attn Opt. · 0m 45s
+ Sent →
+
+
+
+
+ MLA latent proj + flash attn
+ Est. -26.0ms
+
+
+
+
+ ▸
+ Strategy Details
+ Est. 1.35×
+
+
+
Optimize MLA compressed KV path · Reduce latent projection overhead
+
+
+ Strategy:
+ Fuse absorb_attn + flash_attn_fwd · Swizzled KV layout for d_c=512
+
+
+ Target:
+ BW util 53% → 72%, eliminate 122 absorb_proj kernel launches
+
+
+
+
+
+
+
+
+
+
+ Est. Total Saving: -228.2ms (31.9%)
+ 714.8 → 486.6ms
+
+
+
+
+
+
+
+
+
+
Pre-optimization (Baseline)
+
+
+ Bandwidth
+ 2487 GB/s
+
+
+ Compute
+ 1200 TFLOPS
+
+
+ BW Util
+ 31.0%
+
+
+ Compute Util
+ 12.0%
+
+
+ Bottleneck
+ 🔴 Mem Bound
+
+
+
+
+
Post-optimization (GEAK Result)
+
+
+ Bandwidth
+ 5134 GB/s
+
+
+ Compute
+ 2487 TFLOPS
+
+
+ BW Util
+ 64.2% +33.2%
+
+
+ Compute Util
+ 24.7% +12.7%
+
+
+ Speedup ×
+ 2.4×
+
+
+
+
+
+
+
+
+
+
+
+
+ INPUT THROUGHPUT/GPU
+ 52.27 tok/s/GPU
+ was 50.65 ▲ 3.2%
+
+
+ OUTPUT THROUGHPUT/GPU
+ 51.76 tok/s/GPU
+ was 50.16 ▲ 3.2%
+
+
+ TTFT (Time to First Token)
+ 88.15 ms
+ was 90.69ms ▼ 2.8%
+
+
+ ITL (Inter-Token Latency)
+ 9.23 ms
+ was 9.53ms ▼ 3.1%
+
+
+ TOTAL E2E UPLIFT
+ +3.2%
+ Dtype +1.7% | MoE fused +1.5%
+
+
+ GAP vs B200
+ 22% → ~19%
+ gap vs B200 (conc=4) ▼ 3pp
+
+
+
+
+
+
⚠️ MoE Sort: sort kernel 1.6× uplift but MoE kernel 3× slower — under investigation
+
⚠️ BS4 MLA: integrates correctly but 2.2× slowdown vs UT speedup — investigating
+
❌ Meta OOB: Fused MoE — not showing perf improvement
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Claude Sonnet 4.5
+ ▾
+
+
+
Claude Sonnet 4.5
+
Claude Opus 4.5
+
GPT-5
+
GPT-5.1
+
GPT-5 Codex
+
Gemini 3 Pro
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ gpt-oss 120B
+ ▾
+
+
+
+
+
+
+ FP4
+ ▾
+
+
+
+
+
+
+ 1K / 1K
+ ▾
+
+
+
+
+
+
+ MI355X (SGLang)
+ ▾
+
+
+
+
+
+
+ — None —
+ ▾
+
+
+
+
+
+
+
+
+
+
+
+
+
+
No items exposed yet. Select parameters above and click Expose.
+
+
+
+
+
+
+
+
+
+
+
+
+ 📋
+ No models exposed yet. Use the configuration above to add entries.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ INPUT THROUGHPUT/GPU
+ 52.27 tok/s/GPU
+ was 50.65 ▲ 3.2%
+
+
+ OUTPUT THROUGHPUT/GPU
+ 51.76 tok/s/GPU
+ was 50.16 ▲ 3.2%
+
+
+ TTFT (Time to First Token)
+ 88.15 ms
+ was 90.69ms ▼ 2.8%
+
+
+ ITL (Inter-Token Latency)
+ 9.23 ms
+ was 9.53ms ▼ 3.1%
+
+
+ TOTAL E2E UPLIFT
+ +3.2%
+ Dtype +1.7% | MoE fused +1.5%
+
+
+
+
+
+
+ Gap vs B200 (conc=4, TP=8):
+ 22%
+ →
+ ~19%
+ ▼ 3pp
+
+
+
+
+
+
+
+
+
+ MoE 31%
+ of workload
+
+
+ Fused SGLang MoE 20.9%
+
+
+ MLA 13%
+ of workload
+
+
+ Attn GEMMs 13%
+ of workload
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Throughput:
+ 229 TFLOPS → 377 TFLOPS
+ +66%
+
+
+ Latency:
+ 9.44 ms → 5.75 ms
+ 1.64× faster
+
+
+
+ vs Roofline:
+ BW Util: 31% → 64.2% (+33.2%) · Compute Util: 12% → 24.7% (+12.7%) · Speedup: 2.4×
+
+
+
+
+
+
+
+
+
+
+ Optimization:
+ Fused dequantize + GEMM · Shared memory staging
+
+
+ E2E Uplift:
+ Eliminated intermediate buffer writes
+ +1.5%
+
+
+
+ vs Roofline:
+ BW Util: 28% → 58% (+30%) · Reduced BW pressure 40%
+
+
+
+
+
+
+
+
+
+
+ Optimization:
+ Fuse absorb_attn + flash_attn · Swizzled KV layout (d_c=512)
+
+
+ BW Improvement:
+ 53% → 72% HBM peak
+ -26.0ms
+
+
+
+ vs Roofline:
+ Eliminate 122 absorb_proj kernel launches · Reduce latent projection overhead · E2E: 3.6%
+
+
+
+
+
+
+
+
+
+
+ Optimization:
+ Fuse silu + mul into single kernel per expert per layer
+
+
+ Saving:
+ Eliminate one HBM round-trip per expert
+ -10.2ms
+
+
+
+ vs Roofline:
+ BW Util: 52% pre-opt · 122 ops across 61 layers · E2E: 1.4%
+
+
+
+
+
+
+
+
+
+
+ Optimization:
+ Fuse 8-kernel decomposition (cast→pow→mean→…) into single Triton kernel
+
+
+ Saving:
+ 123 instances × 8 kernels → 123 fused ops
+ -4.7ms
+
+
+
+ vs Roofline:
+ BW Util: 93% pre-opt (near BW ceiling) · Single-pass read+normalize+write · E2E: 0.7%
+
+
+
+
+
+
+
+
+
+
+
+
+
+ SemiAnalysis: Dtype change (non-AI optimized) +1.7% uplift | MoE fused +1.5% uplift
+
+
+
+ ⚠️ MoE Sort: sort kernel 1.6× uplift but MoE kernel 3× slower — regression under investigation
+
+
+
+ ⚠️ BS4 MLA: Integrates correctly but 2.2× slowdown vs UT speedup — investigating kernel behavior differences
+
+
+
+ Meta OOB: Fused MoE — not showing perf improvement
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ INPUT THROUGHPUT/GPU
+ 72.92 tok/s/GPU
+ was 67.52 ▲ 8.0%
+
+
+ OUTPUT THROUGHPUT/GPU
+ 71.56 tok/s/GPU
+ was 66.26 ▲ 8.0%
+
+
+ TTFT (Time to First Token)
+ 62.6 ms
+ was 67.6ms ▼ 7.4%
+
+
+ ITL (Inter-Token Latency)
+ 7.37 ms
+ was 7.96ms ▼ 7.4%
+
+
+ TOTAL E2E UPLIFT
+ +8.0%
+ Attn FP8 +15% | 62.7% WL covered
+
+
+
+
+
+
+ Gap vs B200 (conc=4, TP=8):
+ 32%
+ →
+ ~27%
+ ▼ 5pp
+
+
+
+
+
+
+
+
+
+ Attention FP8 32.7%
+ of workload
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Optimization:
+ FP8 delivers 10–15% uplift over BF16
+ +10-15%
+
+
+ Root Cause:
+ Identified FP8 kernel memory fault root cause
+
+
+
+ Notes:
+ Triton already bandwidth-bound · Minimal further optimization possible · Delivered for integration
+
+
+
+
+
+
+
+
+
+
+
+
+
+ MLPerf: ~8% end-to-end uplift in Offline and Server modes
+
+
+
+ Meta OOB: ~2-2.5% overall E2E geomean improvement
+
+
+
+ SemiAnalysis: In Progress — ETA: 3/10
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Per Model: each model → separate file | CSV: raw data export
+
+