Skip to content
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
07b7784
fix(openai): SGLang/vLLM wire-format compatibility
hvagadia May 6, 2026
e98dcf5
feat(multi-turn): cache-bursting salt and faithful prior-turn replay
hvagadia May 6, 2026
df40a4d
chore(09_MultiTurn): retire placeholder configs and tune agentic YAMLs
hvagadia May 6, 2026
8ca0c6e
feat(09_MultiTurn/scripts): add inline-accuracy scorer and dataset an…
hvagadia May 6, 2026
889fc7c
feat(multi_turn): cache-bursting salt + dataset_metadata header toler…
hvagadia May 6, 2026
2c859fb
docs(09_MultiTurn): add accuracy-eval step to README + lint scorer
hvagadia May 11, 2026
89573e5
feat: plumb chat_template_kwargs + intent_codes ground truth
hvagadia May 14, 2026
8dfa1e0
docs: update multi-turn dataset paths
hvagadia May 18, 2026
3d395f8
fix(multi-turn): validate null conversation ids
hvagadia May 18, 2026
eb59b3e
chore(09_MultiTurn): keep local scripts out of PR
hvagadia May 18, 2026
55c04fc
docs(09_MultiTurn): consolidate Kimi benchmark config
hvagadia May 18, 2026
ba4ba32
chore: minimize source diffs
hvagadia May 18, 2026
2c047b5
chore: minimize multi-turn replay diffs
hvagadia May 18, 2026
fbffe8a
feat(multiturn): add inline accuracy workflow
hvagadia May 18, 2026
d9ec2bb
fix: address multi-turn review feedback
hvagadia May 25, 2026
9ebc5a8
chore: remove accuracy helper from pr
hvagadia May 25, 2026
1781aa1
chore: fix import ordering after rebase
hvagadia May 25, 2026
d33250b
docs: trim multi-turn readme accuracy references
hvagadia May 25, 2026
9c1888b
chore: fix pre-commit formatting
hvagadia May 25, 2026
4f76238
fix: address multi-turn review comments
hvagadia Jun 2, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
356 changes: 76 additions & 280 deletions examples/09_MultiTurn/README.md
Original file line number Diff line number Diff line change
@@ -1,297 +1,93 @@
# Multi-Turn Conversation Benchmarking Examples
# Multi-Turn Agentic Benchmark

This directory contains examples for benchmarking conversational AI workloads with multi-turn conversation support.
This example runs multi-turn agentic conversations through an OpenAI-compatible
endpoint. The client preserves conversation order, sends one in-flight turn per
active conversation, and adds `X-Session-ID: <conversation_id>` on every request
so a router can keep a conversation on the same backend.

## Overview
## Dataset

Multi-turn conversation benchmarking enables testing realistic conversational AI scenarios where each turn depends on previous responses. The system maintains conversation history and enforces turn sequencing to simulate real-world multi-turn interactions.

## Dataset Format

Multi-turn datasets use JSONL format with the following structure:
Use flat JSONL with one row per message. Rows for each `conversation_id` must be
contiguous and ordered by increasing `turn`.

```jsonl
{"conversation_id": "c1", "turn": 1, "role": "user", "content": "...", "system": "..."}
{"conversation_id": "c1", "turn": 2, "role": "assistant", "content": "..."}
{"conversation_id": "c1", "turn": 3, "role": "user", "content": "..."}
{"conversation_id":"c1","turn":1,"role":"user","system":"...","content":"...","tools":[...],"delay_seconds":0.4}
{"conversation_id":"c1","turn":2,"role":"assistant","tool_calls":[...]}
{"conversation_id":"c1","turn":3,"role":"tool","tool_results":[...],"delay_seconds":1.2}
{"conversation_id":"c1","turn":4,"role":"assistant","content":"..."}
```

### Required Fields

- `conversation_id`: Unique identifier for each conversation
- `turn`: Turn number within conversation (1-indexed)
- `role`: Speaker role ("user" or "assistant")
- `content`: Message content

### Optional Fields

- `system`: System prompt (typically only on first user turn)
- `model`: Model name override for this turn
- `max_new_tokens`: Maximum tokens to generate for this turn

### Validation Rules

1. All rows for a given `conversation_id` must appear **consecutively** in the file (no interleaving
with rows from other conversations). File-order within a conversation does not matter — the
loader sorts by the `turn` column when building conversation history.
The flat-row format is intentional: it enables row-by-row streaming without loading entire
conversations into memory first.
2. Conversations must follow a valid role sequence:
- Plain chat: `user → assistant → user → ...`
- Agentic: `user → assistant (with tool_calls) → tool (tool_results list; parallel results merged) → [assistant (with tool_calls) → tool]* → assistant → user → ...`
3. First turn must be "user" role
4. Turn numbers must be sequential (1, 2, 3, ...)
5. Each conversation must have at least one turn

## Agentic (Tool-Sequence) Datasets

For agentic workloads where the model dispatches tools, the dataset must include tool-call
metadata. The source format for these datasets is a **snapshot JSONL** — each line contains the
full conversation history at a particular checkpoint. The benchmarker requires **flat-row JSONL**
(one row per message), so a conversion step is needed first.

### Source snapshot format

Each line in the source file represents one snapshot of a conversation:

```json
{
"conversation_id": "sim_001",
"conversation_idx": 5,
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "tool_calls": [{"id": "...", "type": "function", "function": {"name": "bash", "arguments": "{\"cmd\": \"ls\"}"}}]},
{"role": "tool", "tool_call_id": "...", "content": "file1.txt\nfile2.txt"},
{"role": "assistant", "content": "Done."}
],
"tools": [...],
"metadata": {}
}
```
Required fields are `conversation_id`, `turn`, and `role`. User rows normally
include `content`; agentic rows can also include `system`, `tools`,
`tool_calls`, `tool_results`, `reasoning_content`, and `delay_seconds`.

Multiple snapshots may exist per `conversation_id` (one per `conversation_idx`); only the
highest-indexed snapshot per conversation is used.
Place the dataset under `examples/09_MultiTurn/datasets/` or point the YAML at
another accessible JSONL path.

### Converting to flat-row format
## Start A Server

The following commands convert each source snapshot file to the flat-row format required by the benchmarker.
Run from the repo root:
Start an SGLang OpenAI-compatible server. This is the standard recipe used for
throughput replays; adjust `--model-path`, `--tp`, and `--port` for your node.

```bash
# First argument: input snapshot JSONL; second argument: output flat-row JSONL
python scripts/convert_agentic_snapshot.py \
/path/to/agentic_coding_dataset.jsonl \
examples/09_MultiTurn/datasets/agentic_coding_flat.jsonl \
--verify

python scripts/convert_agentic_snapshot.py \
/path/to/agentic_workflow_dataset.jsonl \
examples/09_MultiTurn/datasets/agentic_workflow_flat.jsonl \
--verify
```

The `datasets/` directory under `examples/09_MultiTurn/` is a placeholder; run the conversion
commands above to populate it before benchmarking.

The `--verify` flag cross-checks every client turn's message history against the source snapshot
and exits with code 1 if any mismatch is found. The script also:

- Collapses consecutive `user` messages into one (keeps turn sequencing clean)
- Merges consecutive `tool` messages for the same assistant dispatch into a single row with a
`tool_results` list (so all parallel results are sent together in one API call)

### Flat-row format after conversion

The extra fields supported beyond plain user/assistant:

| Row role | Extra fields |
| ------------------------------------------ | ------------------------------------------------------------------ |
| `assistant` with tool calls | `tool_calls: [{id, type, function: {name, arguments}}]` |
| `tool` results (single or merged parallel) | `tool_results: [{tool_call_id, content}, ...]` |
| `user` or `tool` turns | `tools: [...]` (OpenAI tool definitions forwarded to the endpoint) |

Example rows from a converted agentic dataset:

```jsonl
{"conversation_id": "sim_001", "turn": 1, "role": "user", "content": "Fix the bug in foo.py", "system": "You are a coding agent.", "tools": [...]}
{"conversation_id": "sim_001", "turn": 2, "role": "assistant", "tool_calls": [{"id": "functions.bash:0", "type": "function", "function": {"name": "bash", "arguments": "{\"cmd\": \"cat foo.py\"}"}}]}
{"conversation_id": "sim_001", "turn": 3, "role": "tool", "tool_results": [{"tool_call_id": "functions.bash:0", "content": "def foo():\n return 1/0"}], "tools": [...]}
{"conversation_id": "sim_001", "turn": 4, "role": "assistant", "content": "The bug is a ZeroDivisionError. Here is the fix: ..."}
python3 -m sglang.launch_server \
--model-path /path/to/Kimi-K2.6 \
--served-model-name kimi-k2.6 \
--tp 8 \
--trust-remote-code \
--reasoning-parser kimi_k2 \
--tool-call-parser kimi_k2 \
--host 0.0.0.0 \
--port 8000
```

### Running agentic benchmarks

After converting the datasets, update the `path` field in the config files and run:
`--model-path` is the checkpoint loaded by the server. It can be a local path
visible to the server container or a Hugging Face model id, depending on your
SGLang environment. `--served-model-name` is the OpenAI model name exposed to
clients; set `model_params.name` in the YAML to the same value.

## Client YAML

The runnable config is
`examples/09_MultiTurn/kimi_agentic_benchmark.yaml`.

Key fields:

- `type: online`: runs the benchmark through the online scheduler.
- `model_params.name`: model name sent in each OpenAI request. Keep it aligned
with the served model name.
- `model_params.temperature`, `top_p`, `max_new_tokens`: sampling settings sent
to the server. `max_new_tokens` is large because agent turns can be long.
- `model_params.chat_template_kwargs`: Kimi-specific template options for
reasoning preservation.
- First `datasets` entry `name`: label used in benchmark outputs.
- First `datasets` entry `type: performance`: multi-turn datasets are replayed as
performance datasets.
- First `datasets` entry `path`: JSONL dataset path to run.
- First `datasets` entry `multi_turn.turn_timeout_s`: per-turn deadline. A
timeout aborts the remaining turns in that conversation.
- First `datasets` entry `multi_turn.enable_salt`: appends a deterministic cache
salt to each conversation system prompt.
- First `datasets` entry `multi_turn.inject_tool_delay`: honors positive
`delay_seconds` values from client turns before issuing those turns.
- `settings.runtime.min_duration_ms`: minimum run duration. With no max duration
override, the run finishes when the dataset is exhausted.
- `settings.load_pattern.type: multi_turn`: enables conversation-aware issuing.
- `settings.load_pattern.target_concurrency`: maximum active conversations.
Each active conversation has at most one in-flight request.
- `settings.client.warmup_connections: 0`: avoids stale pre-warmed sockets with
servers that close idle connections quickly.
- `settings.client.max_idle_time`: connection idle lifetime.
- `endpoint_config.endpoints`: server URL list.
- `endpoint_config.api_type: openai`: use `/v1/chat/completions`.
- `report_dir`: output directory for events, snapshots, and reports.

## Run The Client

Update the first `datasets` entry (`name` and `path`), `model_params.name`, and
`endpoint_config.endpoints` as needed, then run:

```bash
inference-endpoint benchmark from-config \
--config examples/09_MultiTurn/agentic_coding_benchmark.yaml

inference-endpoint benchmark from-config \
--config examples/09_MultiTurn/agentic_workflow_benchmark.yaml
```

---

## Configuration

### Basic Configuration

```yaml
datasets:
- name: customer_support
type: performance
path: examples/09_MultiTurn/customer_support_conversations.jsonl
multi_turn:
turn_timeout_s: 300.0

settings:
load_pattern:
type: multi_turn
target_concurrency: 32 # ← Required for multi_turn load pattern
```

### Concurrency Control

The `target_concurrency` field is **required** for the `multi_turn` load pattern and controls the maximum number of conversations active simultaneously (each active conversation has at most one in-flight turn):

```yaml
settings:
load_pattern:
type: multi_turn
target_concurrency: 32 # ← Limit to 32 concurrent requests
uv run inference-endpoint benchmark from-config \
--config examples/09_MultiTurn/kimi_agentic_benchmark.yaml
```

**Behavior**:

- With `target_concurrency`: At most `target_concurrency` conversations are active simultaneously; each active conversation has exactly one in-flight turn at any time.
- Turn sequencing is preserved: turn N+1 is issued only after turn N's response arrives.

**Use cases**:

- **Prevent endpoint overload**: Control request rate to busy endpoints
- **Large-scale testing**: Benchmark 1000+ conversations without overwhelming system
- **Resource management**: Stay within port limits, memory constraints

**Example**: 100 conversations with `target_concurrency: 32`

```
t=0: Start 32 conversations, issue turn-1 for each (32 in-flight)
t=0.5: Turn-1 of conv A completes → issue turn-2 of conv A (still 32 in-flight)
t=1.0: All turns of conv B complete → start conv 33, issue its turn-1 (still 32 in-flight)
... Maintains at most 32 active conversations
```

### Turn Timeout

Configure the maximum time allowed between issuing a turn and receiving its response:

```yaml
multi_turn:
turn_timeout_s: 300.0 # 5 minutes
```

If a turn does not receive a response within `turn_timeout_s` seconds, that turn is marked failed and all remaining turns in the same conversation are aborted (subsequent turns depend on the timed-out response). The event is logged as a warning.

## Running Multi-Turn Benchmarks

### Using Configuration File

```bash
inference-endpoint benchmark from-config \
--config examples/09_MultiTurn/multi_turn_benchmark.yaml
```

### Viewing Results

Multi-turn benchmarks produce per-turn metrics:

- **Per-turn metrics**: Latency, TTFT, TPOT for each individual turn
- **Per-conversation metrics**: Total conversation latency, conversations per second _(planned — not yet implemented)_

**Note**: Multi-turn datasets are only supported as performance datasets. Using a multi-turn dataset as an accuracy dataset (`type: accuracy`) is not yet supported and will raise an error at startup.

Results are stored in the configured `report_dir`. Each record in
`events.jsonl` carries `conversation_id` and `turn` alongside `sample_uuid`,
so conversation-level filtering requires no join. `sample_idx_map.json` maps
`sample_uuid → dataset sample index` for callers that need it.

## Example Datasets

### customer_support_conversations.jsonl

Simple customer support conversations demonstrating basic multi-turn interactions:

- 3 conversations
- 2-4 turns per conversation
- Customer support agent system prompt

## Architecture Notes

### Key Components

- **ConversationManager**: Tracks conversation state and message history
- **MultiTurnStrategy**: Enforces turn sequencing within conversations
- **MultiTurnDataset**: Validates and structures multi-turn data

### Turn Sequencing

The system ensures that:

1. Turn N+1 cannot be issued until turn N completes
2. Message history is included in subsequent requests
3. Concurrent conversations are supported (in independent mode)

### Memory Considerations

Each conversation maintains message history in memory. For large-scale benchmarks with long conversations:

- Memory usage: ~1KB per turn (approximate)
- 1000 conversations × 10 turns = ~10MB

## Troubleshooting

### "Conversation has invalid role sequence"

**Cause**: Conversation doesn't follow a valid role sequence.

**Fix**: For plain chat, ensure the dataset alternates between user and assistant:

```
user -> assistant -> user -> assistant -> ...
```

For agentic datasets, use the conversion script (`scripts/convert_agentic_snapshot.py`) to
produce a properly sequenced flat-row file. The valid agentic sequence is:

```
user -> assistant (tool_calls) -> tool -> [assistant (tool_calls) -> tool]* -> assistant -> user -> ...
```

**Note**: Parallel tool results from a single dispatch must be **merged into
one row** with a `tool_results` list, not represented as multiple consecutive
`tool` rows. The validator rejects consecutive `tool` rows.

### "Turn timed out"

**Cause**: A turn did not receive a response within `turn_timeout_s` seconds after it was issued.

**Fixes**:

- Increase `turn_timeout_s` in configuration
- Check endpoint performance
- Verify endpoint is responding

### Single-turn benchmarks unaffected

Multi-turn logic is only activated when a `multi_turn:` block is present in the dataset configuration. Existing single-turn benchmarks continue to work unchanged with zero performance overhead.

## Future Enhancements

Planned features:

- [ ] Poisson conversation arrival mode
- [ ] Per-conversation metrics in reporting (total conversation latency, conversations per second)
- [ ] Conversation-level latency percentiles
- [ ] Dynamic conversation branching
Loading
Loading