-
Notifications
You must be signed in to change notification settings - Fork 62
Add async background warmup to reduce first-kernel latency #721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
KaanKesginLW
wants to merge
7
commits into
JuliaGPU:main
Choose a base branch
from
KaanKesginLW:feature/async-warmup
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
2388be1
Add async background warmup to reduce first-kernel latency
KaanKesginLW 4381232
Retrigger CI for benchmark failure investigation
KaanKesginLW 1889292
Apply Runic formatting fixes
KaanKesginLW 25061bd
Address review feedback: remove export, skip warmup on single thread
KaanKesginLW 2aae519
Fix warmup tests to handle single-threaded CI execution
KaanKesginLW 24afa94
Simplify warmup tests to only test public API behavior
KaanKesginLW d4db4a1
Use Threads.@spawn instead of @async for warmup task
KaanKesginLW File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -76,6 +76,8 @@ export MetalBackend | |
|
|
||
| include("deprecated.jl") | ||
|
|
||
| include("warmup.jl") | ||
|
|
||
| include("precompile.jl") | ||
|
|
||
| end # module | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
| @@ -0,0 +1,71 @@ | ||||
| # Async warmup to reduce first-kernel JIT compilation latency | ||||
| # | ||||
| # The first GPU kernel in a Metal.jl session takes ~1.75s due to one-time JIT | ||||
| # compilation of GPUCompiler internals. By starting a minimal kernel compilation | ||||
| # in the background during __init__(), we can reduce this to 0.035-0.20s for the | ||||
| # user's first actual kernel—a 9-50x improvement. | ||||
|
|
||||
| export warmup | ||||
|
|
||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should fix the benchmark error.
Suggested change
|
||||
| # Minimal kernel that triggers the full compilation pipeline | ||||
| function _warmup_kernel!(a) | ||||
| i = thread_position_in_grid().x | ||||
| if i <= length(a) | ||||
| a[i] = 0.0f0 | ||||
| end | ||||
| return nothing | ||||
| end | ||||
|
|
||||
| # Called from __init__() via @async | ||||
| function _warmup_compilation() | ||||
| try | ||||
| # Minimal allocation - just need to trigger compilation | ||||
| arr = MtlArray{Float32}(undef, 1) | ||||
| # launch=false compiles but doesn't execute - fastest warmup path | ||||
| @metal launch=false _warmup_kernel!(arr) | ||||
| unsafe_free!(arr) | ||||
| catch | ||||
| # Silently ignore warmup failures - this is a non-critical optimization | ||||
| end | ||||
| return nothing | ||||
| end | ||||
|
|
||||
| """ | ||||
| warmup(; blocking::Bool=true) | ||||
|
|
||||
| Ensure the GPU compilation pipeline is warmed up. | ||||
|
|
||||
| The first GPU kernel in a Metal.jl session incurs a one-time JIT compilation overhead | ||||
| of ~1.7 seconds. Metal.jl automatically starts warming up in the background when the | ||||
| package is loaded. This function allows you to explicitly wait for warmup to complete. | ||||
|
|
||||
| If `blocking=true` (default), waits for warmup to complete before returning. | ||||
| If `blocking=false`, returns immediately while warmup continues in background. | ||||
|
|
||||
| # When to use | ||||
|
|
||||
| Call `warmup()` before timing-sensitive code to ensure consistent benchmark results: | ||||
|
|
||||
| ```julia | ||||
| using Metal | ||||
| Metal.warmup() # wait for warmup to complete | ||||
| @time @metal kernel!(a) # consistently fast (~0.035s, not ~1.7s) | ||||
| ``` | ||||
|
|
||||
| # Note | ||||
|
|
||||
| You never need to call this function for correctness—only for consistent timing. | ||||
| Most users will never need to call this explicitly, as the background warmup will | ||||
| complete during normal program setup (loading data, preprocessing, etc.). | ||||
| """ | ||||
| function warmup(; blocking::Bool=true) | ||||
| task = _warmup_task[] | ||||
| if task === nothing | ||||
| # Warmup wasn't started (non-functional GPU or disabled) | ||||
| return nothing | ||||
| end | ||||
| if blocking | ||||
| wait(task) | ||||
| end | ||||
| return nothing | ||||
| end | ||||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,66 @@ | ||
| @testset "warmup" begin | ||
| @testset "warmup task started" begin | ||
| # Warmup should have been started during __init__ | ||
| @test Metal._warmup_task[] !== nothing | ||
| @test Metal._warmup_enabled == true | ||
| end | ||
|
|
||
| @testset "warmup API" begin | ||
| # Non-blocking call should return immediately | ||
| @test Metal.warmup(blocking=false) === nothing | ||
|
|
||
| # Blocking call should wait and return nothing | ||
| @test Metal.warmup() === nothing | ||
| @test Metal.warmup(blocking=true) === nothing | ||
| end | ||
|
|
||
| @testset "warmup task completion" begin | ||
| # After calling warmup(), task should be done | ||
| Metal.warmup() | ||
| task = Metal._warmup_task[] | ||
| @test istaskdone(task) | ||
| @test !istaskfailed(task) | ||
| end | ||
|
|
||
| @testset "warmup accelerates compilation" begin | ||
| # After warmup, kernel compilation should be fast | ||
| Metal.warmup() | ||
|
|
||
| function test_kernel!(a) | ||
| i = thread_position_in_grid().x | ||
| if i <= length(a) | ||
| a[i] = 1.0f0 | ||
| end | ||
| return nothing | ||
| end | ||
|
|
||
| a = MtlArray{Float32}(undef, 256) | ||
| t = @elapsed @metal launch=false test_kernel!(a) | ||
|
|
||
| # After warmup, compilation should be under 0.5s | ||
| # (without warmup it would be ~1.7s) | ||
| @test t < 0.5 | ||
| end | ||
|
|
||
| @testset "concurrent kernel compilation" begin | ||
| # Verify that concurrent compilations don't deadlock | ||
| Metal.warmup() | ||
|
|
||
| function k1!(a) | ||
| a[1] = 1.0f0 | ||
| return nothing | ||
| end | ||
| function k2!(a) | ||
| a[1] = 2.0f0 | ||
| return nothing | ||
| end | ||
|
|
||
| a = MtlArray{Float32}(undef, 1) | ||
|
|
||
| t1 = @async @metal launch=false k1!(a) | ||
| t2 = @async @metal launch=false k2!(a) | ||
|
|
||
| # Should complete without deadlock (with timeout) | ||
| @test timedwait(() -> istaskdone(t1) && istaskdone(t2), 10.0) == :ok | ||
| end | ||
| end |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asyncis pinned to the same thread as parent.