Fix cross-stream memory visibility race on DGX spark Blackwell (sm_121) by willdzeng · Pull Request #41 · nvidia-isaac/cuVSLAM

willdzeng · 2026-04-10T05:36:41Z

Summary

Fix CUDA error 700 (illegal memory access) on Blackwell GPUs (sm_121) in multi-camera visual odometry. The crash is deterministic at ~200 frames with 6-camera (3 stereo pair) configurations.

Root Cause

MonoSOFGPU::track() builds image pyramids and gradient pyramids on per-camera CUDA streams. These pyramids are later read by stereo pair tracking (LaunchTrackingPrimaryToSecondary) on different per-pair streams, without explicit cross-stream synchronization.

On pre-Blackwell GPUs, the implicit memory ordering happened to work. On Blackwell (sm_121), the stricter memory model requires explicit synchronization for cross-stream memory visibility.

Fix

sof_mono_gpu.cpp: Add cudaStreamSynchronize after pyramid + gradient build to ensure GPU memory is visible to other streams
sof_multicamera_base.cpp: Add cudaDeviceSynchronize barriers before mono finish and stereo launch phases

Testing

Tested on NVIDIA GB10 (aarch64, CUDA 13, sm_121) with a 6-camera rig (3 stereo pairs), 8075 frames:

Variant	Result
No fix	Crash at frame ~209
`cudaDeviceSynchronize` before stereo launch only	Crash at frame ~1053
+ `cudaDeviceSynchronize` before mono finish	Crash at frame ~2407
+ `cudaStreamSynchronize` after mono pyramid build	All 8075 frames pass
`CUDA_LAUNCH_BLOCKING=1` (reference)	All frames pass (confirms async race)
`compute-sanitizer` (reference)	All frames pass (serialized execution)

Notes

This fix is conservative — cudaStreamSynchronize per mono stream and cudaDeviceSynchronize at phase boundaries. A more targeted approach using cudaEvent + cudaStreamWaitEvent could preserve more parallelism but requires deeper changes to the Stream/ImageContext ownership model.

On Blackwell GPUs (sm_121, e.g. GB10), CUDA memory writes from one stream are not guaranteed visible to kernels on another stream without explicit synchronization. This causes CUDA error 700 (illegal memory access) in multi-camera visual odometry after ~200 frames. Root cause: MonoSOFGPU::track() builds image pyramids and gradient pyramids on per-camera CUDA streams. These pyramids are later read by stereo pair tracking (LaunchTrackingPrimaryToSecondary) on different per-pair streams. On pre-Blackwell GPUs, the implicit memory ordering happened to work. On Blackwell, the stricter memory model exposes the missing synchronization. Fix: - Add cudaStreamSynchronize after pyramid+gradient build in MonoSOFGPU::track() to ensure data is visible to other streams - Add cudaDeviceSynchronize barriers in MultiSOFBase::trackNextFrame() before mono finish and stereo launch phases Tested on NVIDIA GB10 (aarch64, CUDA 13, sm_121) with 6-camera rig (3 stereo pairs, 8075 frames). Without fix: deterministic crash at frame ~209. With fix: completes all frames without error. The fix is conservative — cudaStreamSynchronize per mono stream and cudaDeviceSynchronize at phase boundaries. A more targeted approach using cudaEvent + cudaStreamWaitEvent could preserve more parallelism but requires deeper changes to the Stream/ImageContext ownership model.

willdzeng · 2026-04-10T05:38:09Z

@aefitorov-nvidia @zwdoescode can you take a look if this is the correct fix for the DGX spark (blackwell)? I tested it on dgx spark it seems works

willdzeng changed the title ~~Fix cross-stream memory visibility race on Blackwell (sm_121)~~ Fix cross-stream memory visibility race on DGX spark Blackwell (sm_121) Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cross-stream memory visibility race on DGX spark Blackwell (sm_121)#41

Fix cross-stream memory visibility race on DGX spark Blackwell (sm_121)#41
willdzeng wants to merge 1 commit intonvidia-isaac:mainfrom
willdzeng:fix/blackwell-cross-stream-sync

willdzeng commented Apr 10, 2026

Uh oh!

willdzeng commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

willdzeng commented Apr 10, 2026

Summary

Root Cause

Fix

Testing

Notes

Uh oh!

willdzeng commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant