Skip to content

Port merge_embeddings benchmark to tritonbench#5650

Open
q10 wants to merge 1 commit intopytorch:mainfrom
q10:export-D101068554
Open

Port merge_embeddings benchmark to tritonbench#5650
q10 wants to merge 1 commit intopytorch:mainfrom
q10:export-D101068554

Conversation

@q10
Copy link
Copy Markdown
Contributor

@q10 q10 commented Apr 16, 2026

Summary:
Add Kineto trace export support to the merge_embeddings benchmark for A/B comparison with the tritonbench port.

This diff extends merge_embeddings_benchmark.py with two new CLI options that enable exporting Kineto Chrome traces. This is needed for kernel-level A/B comparisons between the existing FBGEMM benchmark and the new tritonbench port of the same operator.

  1. --export-trace flag: When enabled, switches the profiler from the existing ProfilerActivity.CUDA + table-print path to a on_trace_ready callback that exports a Chrome trace file.
  2. --trace-url template: Configurable output path with {ospid} placeholder for the PID, defaulting to merge_embeddings_fwd_trace_{ospid}.json.
  3. _kineto_trace_handler() inner function: Formats the trace URL with the current PID and calls p.export_chrome_trace(url).
  4. Plumbing: Both new options are threaded from the cli() Click entrypoint through to benchmark_merge_pooled_embeddings().

The existing (non-trace) profiling behavior is preserved as the default path when --export-trace is not passed.

Detailed Changes

merge_embeddings_benchmark.py

  • Added import os and from contextlib import nullcontext (note: nullcontext is unused and flagged by lint)
  • Added export_trace and trace_url parameters to benchmark_merge_pooled_embeddings()
  • Refactored profiling block into two branches: trace-export path (on_trace_ready) vs. existing CUDA activity profiling path
  • Added --export-trace and --trace-url Click options to cli()
  • Plumbed new options through both the sweep loop and single-run call paths

Reviewed By: henrylhtsang

Differential Revision: D101068554

Summary:
Add Kineto trace export support to the merge_embeddings benchmark for A/B comparison with the tritonbench port.

This diff extends `merge_embeddings_benchmark.py` with two new CLI options that enable exporting Kineto Chrome traces. This is needed for kernel-level A/B comparisons between the existing FBGEMM benchmark and the new tritonbench port of the same operator.

1. **`--export-trace` flag**: When enabled, switches the profiler from the existing `ProfilerActivity.CUDA` + table-print path to a `on_trace_ready` callback that exports a Chrome trace file.
2. **`--trace-url` template**: Configurable output path with `{ospid}` placeholder for the PID, defaulting to `merge_embeddings_fwd_trace_{ospid}.json`.
3. **`_kineto_trace_handler()` inner function**: Formats the trace URL with the current PID and calls `p.export_chrome_trace(url)`.
4. **Plumbing**: Both new options are threaded from the `cli()` Click entrypoint through to `benchmark_merge_pooled_embeddings()`.

The existing (non-trace) profiling behavior is preserved as the default path when `--export-trace` is not passed.

## Detailed Changes

### merge_embeddings_benchmark.py
- Added `import os` and `from contextlib import nullcontext` (note: `nullcontext` is unused and flagged by lint)
- Added `export_trace` and `trace_url` parameters to `benchmark_merge_pooled_embeddings()`
- Refactored profiling block into two branches: trace-export path (`on_trace_ready`) vs. existing CUDA activity profiling path
- Added `--export-trace` and `--trace-url` Click options to `cli()`
- Plumbed new options through both the sweep loop and single-run call paths

Reviewed By: henrylhtsang

Differential Revision: D101068554
@meta-cla meta-cla Bot added the cla signed label Apr 16, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 16, 2026

@q10 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101068554.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant