mlcommons · hvagadia · Jun 2, 2026 · May 6, 2026 · May 6, 2026 · May 6, 2026
@@ -1,297 +1,93 @@
-# Multi-Turn Conversation Benchmarking Examples
+# Multi-Turn Agentic Benchmark
 
-This directory contains examples for benchmarking conversational AI workloads with multi-turn conversation support.
+This example runs multi-turn agentic conversations through an OpenAI-compatible
+endpoint. The client preserves conversation order, sends one in-flight turn per
+active conversation, and adds `X-Session-ID: <conversation_id>` on every request
+so a router can keep a conversation on the same backend.
 
-## Overview
+## Dataset
 
-Multi-turn conversation benchmarking enables testing realistic conversational AI scenarios where each turn depends on previous responses. The system maintains conversation history and enforces turn sequencing to simulate real-world multi-turn interactions.
-
-## Dataset Format
-
-Multi-turn datasets use JSONL format with the following structure:
+Use flat JSONL with one row per message. Rows for each `conversation_id` must be
+contiguous and ordered by increasing `turn`.
 
 ```jsonl
-{"conversation_id": "c1", "turn": 1, "role": "user", "content": "...", "system": "..."}
-{"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "..."}
-{"conversation_id": "c1", "turn": 3, "role": "user", "content": "..."}
+{"conversation_id":"c1","turn":1,"role":"user","system":"...","content":"...","tools":[...],"delay_seconds":0.4}
+{"conversation_id":"c1","turn":2,"role":"assistant","tool_calls":[...]}
+{"conversation_id":"c1","turn":3,"role":"tool","tool_results":[...],"delay_seconds":1.2}
+{"conversation_id":"c1","turn":4,"role":"assistant","content":"..."}
 ```
 
-### Required Fields
-
-- `conversation_id`: Unique identifier for each conversation
-- `turn`: Turn number within conversation (1-indexed)
-- `role`: Speaker role ("user" or "assistant")
-- `content`: Message content
-
-### Optional Fields
-
-- `system`: System prompt (typically only on first user turn)
-- `model`: Model name override for this turn
-- `max_new_tokens`: Maximum tokens to generate for this turn
-
-### Validation Rules
-
-1. All rows for a given `conversation_id` must appear **consecutively** in the file (no interleaving
-   with rows from other conversations). File-order within a conversation does not matter — the
-   loader sorts by the `turn` column when building conversation history.
-   The flat-row format is intentional: it enables row-by-row streaming without loading entire
-   conversations into memory first.
-2. Conversations must follow a valid role sequence:
-   - Plain chat: `user → assistant → user → ...`
-   - Agentic: `user → assistant (with tool_calls) → tool (tool_results list; parallel results merged) → [assistant (with tool_calls) → tool]* → assistant → user → ...`
-3. First turn must be "user" role
-4. Turn numbers must be sequential (1, 2, 3, ...)
-5. Each conversation must have at least one turn
-
-## Agentic (Tool-Sequence) Datasets
-
-For agentic workloads where the model dispatches tools, the dataset must include tool-call
-metadata. The source format for these datasets is a **snapshot JSONL** — each line contains the
-full conversation history at a particular checkpoint. The benchmarker requires **flat-row JSONL**
-(one row per message), so a conversion step is needed first.
-
-### Source snapshot format
-
-Each line in the source file represents one snapshot of a conversation:
-
-```json
-{
-  "conversation_id": "sim_001",
-  "conversation_idx": 5,
-  "messages": [
-    {"role": "system", "content": "..."},
-    {"role": "user", "content": "..."},
-    {"role": "assistant", "tool_calls": [{"id": "...", "type": "function", "function": {"name": "bash", "arguments": "{\"cmd\": \"ls\"}"}}]},
-    {"role": "tool", "tool_call_id": "...", "content": "file1.txt\nfile2.txt"},
-    {"role": "assistant", "content": "Done."}
-  ],
-  "tools": [...],
-  "metadata": {}
-}
-```
+Required fields are `conversation_id`, `turn`, and `role`. User rows normally
+include `content`; agentic rows can also include `system`, `tools`,
+`tool_calls`, `tool_results`, `reasoning_content`, and `delay_seconds`.
 
-Multiple snapshots may exist per `conversation_id` (one per `conversation_idx`); only the
-highest-indexed snapshot per conversation is used.
+Place the dataset under `examples/09_MultiTurn/datasets/` or point the YAML at
+another accessible JSONL path.
 
-### Converting to flat-row format
+## Start A Server
 
-The following commands convert each source snapshot file to the flat-row format required by the benchmarker.
-Run from the repo root:
+Start an SGLang OpenAI-compatible server. This is the standard recipe used for
+throughput replays; adjust `--model-path`, `--tp`, and `--port` for your node.
 
 ```bash
-# First argument: input snapshot JSONL; second argument: output flat-row JSONL
-python scripts/convert_agentic_snapshot.py \
-    /path/to/agentic_coding_dataset.jsonl \
-    examples/09_MultiTurn/datasets/agentic_coding_flat.jsonl \
-    --verify
-
-python scripts/convert_agentic_snapshot.py \
-    /path/to/agentic_workflow_dataset.jsonl \
-    examples/09_MultiTurn/datasets/agentic_workflow_flat.jsonl \
-    --verify
-```
-
-The `datasets/` directory under `examples/09_MultiTurn/` is a placeholder; run the conversion
-commands above to populate it before benchmarking.
-
-The `--verify` flag cross-checks every client turn's message history against the source snapshot
-and exits with code 1 if any mismatch is found. The script also:
-
-- Collapses consecutive `user` messages into one (keeps turn sequencing clean)
-- Merges consecutive `tool` messages for the same assistant dispatch into a single row with a
-  `tool_results` list (so all parallel results are sent together in one API call)
-
-### Flat-row format after conversion
-
-The extra fields supported beyond plain user/assistant:
-
-| Row role                                   | Extra fields                                                       |
-| ------------------------------------------ | ------------------------------------------------------------------ |
-| `assistant` with tool calls                | `tool_calls: [{id, type, function: {name, arguments}}]`            |
-| `tool` results (single or merged parallel) | `tool_results: [{tool_call_id, content}, ...]`                     |
-| `user` or `tool` turns                     | `tools: [...]` (OpenAI tool definitions forwarded to the endpoint) |
-
-Example rows from a converted agentic dataset:
-
-```jsonl
-{"conversation_id": "sim_001", "turn": 1, "role": "user", "content": "Fix the bug in foo.py", "system": "You are a coding agent.", "tools": [...]}
-{"conversation_id": "sim_001", "turn": 2, "role": "assistant", "tool_calls": [{"id": "functions.bash:0", "type": "function", "function": {"name": "bash", "arguments": "{\"cmd\": \"cat foo.py\"}"}}]}
-{"conversation_id": "sim_001", "turn": 3, "role": "tool", "tool_results": [{"tool_call_id": "functions.bash:0", "content": "def foo():\n    return 1/0"}], "tools": [...]}
-{"conversation_id": "sim_001", "turn": 4, "role": "assistant", "content": "The bug is a ZeroDivisionError. Here is the fix: ..."}
+python3 -m sglang.launch_server \
+  --model-path /path/to/Kimi-K2.6 \
+  --served-model-name kimi-k2.6 \
+  --tp 8 \
+  --trust-remote-code \
+  --reasoning-parser kimi_k2 \
+  --tool-call-parser kimi_k2 \
+  --host 0.0.0.0 \
+  --port 8000
 ```
 
-### Running agentic benchmarks
-
-After converting the datasets, update the `path` field in the config files and run:
+`--model-path` is the checkpoint loaded by the server. It can be a local path
+visible to the server container or a Hugging Face model id, depending on your
+SGLang environment. `--served-model-name` is the OpenAI model name exposed to
+clients; set `model_params.name` in the YAML to the same value.
+
+## Client YAML
+
+The runnable config is
+`examples/09_MultiTurn/kimi_agentic_benchmark.yaml`.
+
+Key fields:
+
+- `type: online`: runs the benchmark through the online scheduler.
+- `model_params.name`: model name sent in each OpenAI request. Keep it aligned
+  with the served model name.
+- `model_params.temperature`, `top_p`, `max_new_tokens`: sampling settings sent
+  to the server. `max_new_tokens` is large because agent turns can be long.
+- `model_params.chat_template_kwargs`: Kimi-specific template options for
+  reasoning preservation.
+- First `datasets` entry `name`: label used in benchmark outputs.
+- First `datasets` entry `type: performance`: multi-turn datasets are replayed as
+  performance datasets.
+- First `datasets` entry `path`: JSONL dataset path to run.
+- First `datasets` entry `multi_turn.turn_timeout_s`: per-turn deadline. A
+  timeout aborts the remaining turns in that conversation.
+- First `datasets` entry `multi_turn.enable_salt`: appends a deterministic cache
+  salt to each conversation system prompt.
+- First `datasets` entry `multi_turn.inject_tool_delay`: honors positive
+  `delay_seconds` values from client turns before issuing those turns.
+- `settings.runtime.min_duration_ms`: minimum run duration. With no max duration
+  override, the run finishes when the dataset is exhausted.
+- `settings.load_pattern.type: multi_turn`: enables conversation-aware issuing.
+- `settings.load_pattern.target_concurrency`: maximum active conversations.
+  Each active conversation has at most one in-flight request.
+- `settings.client.warmup_connections: 0`: avoids stale pre-warmed sockets with
+  servers that close idle connections quickly.
+- `settings.client.max_idle_time`: connection idle lifetime.
+- `endpoint_config.endpoints`: server URL list.
+- `endpoint_config.api_type: openai`: use `/v1/chat/completions`.
+- `report_dir`: output directory for events, snapshots, and reports.
+
+## Run The Client
+
+Update the first `datasets` entry (`name` and `path`), `model_params.name`, and
+`endpoint_config.endpoints` as needed, then run:
 
 ```bash
-inference-endpoint benchmark from-config \
-    --config examples/09_MultiTurn/agentic_coding_benchmark.yaml
-
-inference-endpoint benchmark from-config \
-    --config examples/09_MultiTurn/agentic_workflow_benchmark.yaml
-```
-
----
-
-## Configuration
-
-### Basic Configuration
-
-```yaml
-datasets:
-  - name: customer_support
-    type: performance
-    path: examples/09_MultiTurn/customer_support_conversations.jsonl
-    multi_turn:
-      turn_timeout_s: 300.0
-
-settings:
-  load_pattern:
-    type: multi_turn
-    target_concurrency: 32 # ← Required for multi_turn load pattern
-```
-
-### Concurrency Control
-
-The `target_concurrency` field is **required** for the `multi_turn` load pattern and controls the maximum number of conversations active simultaneously (each active conversation has at most one in-flight turn):
-
-```yaml
-settings:
-  load_pattern:
-    type: multi_turn
-    target_concurrency: 32 # ← Limit to 32 concurrent requests
+uv run inference-endpoint benchmark from-config \
+  --config examples/09_MultiTurn/kimi_agentic_benchmark.yaml
 ```
-
-**Behavior**:
-
-- With `target_concurrency`: At most `target_concurrency` conversations are active simultaneously; each active conversation has exactly one in-flight turn at any time.
-- Turn sequencing is preserved: turn N+1 is issued only after turn N's response arrives.
-
-**Use cases**:
-
-- **Prevent endpoint overload**: Control request rate to busy endpoints
-- **Large-scale testing**: Benchmark 1000+ conversations without overwhelming system
-- **Resource management**: Stay within port limits, memory constraints
-
-**Example**: 100 conversations with `target_concurrency: 32`
-
-```
-t=0:   Start 32 conversations, issue turn-1 for each (32 in-flight)
-t=0.5: Turn-1 of conv A completes → issue turn-2 of conv A (still 32 in-flight)
-t=1.0: All turns of conv B complete → start conv 33, issue its turn-1 (still 32 in-flight)
-...    Maintains at most 32 active conversations
-```
-
-### Turn Timeout
-
-Configure the maximum time allowed between issuing a turn and receiving its response:
-
-```yaml
-multi_turn:
-  turn_timeout_s: 300.0 # 5 minutes
-```
-
-If a turn does not receive a response within `turn_timeout_s` seconds, that turn is marked failed and all remaining turns in the same conversation are aborted (subsequent turns depend on the timed-out response). The event is logged as a warning.
-
-## Running Multi-Turn Benchmarks
-
-### Using Configuration File
-
-```bash
-inference-endpoint benchmark from-config \
-  --config examples/09_MultiTurn/multi_turn_benchmark.yaml
-```
-
-### Viewing Results
-
-Multi-turn benchmarks produce per-turn metrics:
-
-- **Per-turn metrics**: Latency, TTFT, TPOT for each individual turn
-- **Per-conversation metrics**: Total conversation latency, conversations per second _(planned — not yet implemented)_
-
-**Note**: Multi-turn datasets are only supported as performance datasets. Using a multi-turn dataset as an accuracy dataset (`type: accuracy`) is not yet supported and will raise an error at startup.
-
-Results are stored in the configured `report_dir`. Each record in
-`events.jsonl` carries `conversation_id` and `turn` alongside `sample_uuid`,
-so conversation-level filtering requires no join. `sample_idx_map.json` maps
-`sample_uuid → dataset sample index` for callers that need it.
-
-## Example Datasets
-
-### customer_support_conversations.jsonl
-
-Simple customer support conversations demonstrating basic multi-turn interactions:
-
-- 3 conversations
-- 2-4 turns per conversation
-- Customer support agent system prompt
-
-## Architecture Notes
-
-### Key Components
-
-- **ConversationManager**: Tracks conversation state and message history
-- **MultiTurnStrategy**: Enforces turn sequencing within conversations
-- **MultiTurnDataset**: Validates and structures multi-turn data
-
-### Turn Sequencing
-
-The system ensures that:
-
-1. Turn N+1 cannot be issued until turn N completes
-2. Message history is included in subsequent requests
-3. Concurrent conversations are supported (in independent mode)
-
-### Memory Considerations
-
-Each conversation maintains message history in memory. For large-scale benchmarks with long conversations:
-
-- Memory usage: ~1KB per turn (approximate)
-- 1000 conversations × 10 turns = ~10MB
-
-## Troubleshooting
-
-### "Conversation has invalid role sequence"
-
-**Cause**: Conversation doesn't follow a valid role sequence.
-
-**Fix**: For plain chat, ensure the dataset alternates between user and assistant:
-
-```
-user -> assistant -> user -> assistant -> ...
-```
-
-For agentic datasets, use the conversion script (`scripts/convert_agentic_snapshot.py`) to
-produce a properly sequenced flat-row file. The valid agentic sequence is:
-
-```
-user -> assistant (tool_calls) -> tool -> [assistant (tool_calls) -> tool]* -> assistant -> user -> ...
-```
-
-**Note**: Parallel tool results from a single dispatch must be **merged into
-one row** with a `tool_results` list, not represented as multiple consecutive
-`tool` rows. The validator rejects consecutive `tool` rows.
-
-### "Turn timed out"
-
-**Cause**: A turn did not receive a response within `turn_timeout_s` seconds after it was issued.
-
-**Fixes**:
-
-- Increase `turn_timeout_s` in configuration
-- Check endpoint performance
-- Verify endpoint is responding
-
-### Single-turn benchmarks unaffected
-
-Multi-turn logic is only activated when a `multi_turn:` block is present in the dataset configuration. Existing single-turn benchmarks continue to work unchanged with zero performance overhead.
-
-## Future Enhancements
-
-Planned features:
-
-- [ ] Poisson conversation arrival mode
-- [ ] Per-conversation metrics in reporting (total conversation latency, conversations per second)
-- [ ] Conversation-level latency percentiles
-- [ ] Dynamic conversation branching