Skip to content

[Do not merge] Switch to GPUArrays.jl accumulate implementation#625

Open
christiangnrd wants to merge 3 commits intomainfrom
noaccum
Open

[Do not merge] Switch to GPUArrays.jl accumulate implementation#625
christiangnrd wants to merge 3 commits intomainfrom
noaccum

Conversation

@christiangnrd
Copy link
Copy Markdown
Member

Opened to run benchmarks.

Todo:

  • Add compat bound when GPUArrays version released

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Details
Benchmark suite Current: 7747983 Previous: 28d2eb3 Ratio
latency/precompile 25220740333 ns 25451816292 ns 0.99
latency/ttfp 2185103083 ns 2366668104.5 ns 0.92
latency/import 1255539541.5 ns 1439498750 ns 0.87
integration/metaldevrt 861687.5 ns 865458 ns 1.00
integration/byval/slices=1 1604917 ns 1573625 ns 1.02
integration/byval/slices=3 20285021.5 ns 8948666.5 ns 2.27
integration/byval/reference 1582459 ns 1548416 ns 1.02
integration/byval/slices=2 2754083 ns 2650958 ns 1.04
kernel/indexing 486458 ns 647875 ns 0.75
kernel/indexing_checked 483271 ns 636375 ns 0.76
kernel/launch 13292 ns 11625 ns 1.14
kernel/rand 546084 ns 578375 ns 0.94
array/construct 6750 ns 6084 ns 1.11
array/broadcast 563917 ns 605917 ns 0.93
array/random/randn/Float32 1034166 ns 1008250 ns 1.03
array/random/randn!/Float32 725875 ns 729458 ns 1.00
array/random/rand!/Int64 542750 ns 547959 ns 0.99
array/random/rand!/Float32 544583 ns 597333 ns 0.91
array/random/rand/Int64 917437.5 ns 743500 ns 1.23
array/random/rand/Float32 832562.5 ns 636959 ns 1.31
array/accumulate/Int64/1d 2426271 ns 1254333 ns 1.93
array/accumulate/Int64/dims=1 2327479.5 ns 1840916.5 ns 1.26
array/accumulate/Int64/dims=2 2542125 ns 2154750 ns 1.18
array/accumulate/Int64/dims=1L 6704167 ns 11672583 ns 0.57
array/accumulate/Int64/dims=2L 18939313 ns 9846791.5 ns 1.92
array/accumulate/Float32/1d 1804625 ns 1104187.5 ns 1.63
array/accumulate/Float32/dims=1 2078459 ns 1543791.5 ns 1.35
array/accumulate/Float32/dims=2 2368062.5 ns 1875562.5 ns 1.26
array/accumulate/Float32/dims=1L 5188646 ns 9790417 ns 0.53
array/accumulate/Float32/dims=2L 15447000 ns 7247208 ns 2.13
array/reductions/reduce/Int64/1d 1308958.5 ns 1573749.5 ns 0.83
array/reductions/reduce/Int64/dims=1 1133541.5 ns 1091271 ns 1.04
array/reductions/reduce/Int64/dims=2 1155750 ns 1133334 ns 1.02
array/reductions/reduce/Int64/dims=1L 2073625 ns 2015499.5 ns 1.03
array/reductions/reduce/Int64/dims=2L 4038479 ns 4237625 ns 0.95
array/reductions/reduce/Float32/1d 756833 ns 991562.5 ns 0.76
array/reductions/reduce/Float32/dims=1 820062 ns 837958 ns 0.98
array/reductions/reduce/Float32/dims=2 840458 ns 865645.5 ns 0.97
array/reductions/reduce/Float32/dims=1L 1369479 ns 1332167 ns 1.03
array/reductions/reduce/Float32/dims=2L 1833979 ns 1802521 ns 1.02
array/reductions/mapreduce/Int64/1d 1328417 ns 1505209 ns 0.88
array/reductions/mapreduce/Int64/dims=1 1143542 ns 1092375 ns 1.05
array/reductions/mapreduce/Int64/dims=2 1182709 ns 1141166.5 ns 1.04
array/reductions/mapreduce/Int64/dims=1L 1985958 ns 2016167 ns 0.99
array/reductions/mapreduce/Int64/dims=2L 3659250 ns 3599334 ns 1.02
array/reductions/mapreduce/Float32/1d 742834 ns 1036333.5 ns 0.72
array/reductions/mapreduce/Float32/dims=1 828521 ns 845584 ns 0.98
array/reductions/mapreduce/Float32/dims=2 861229.5 ns 849958 ns 1.01
array/reductions/mapreduce/Float32/dims=1L 1349125 ns 1305625 ns 1.03
array/reductions/mapreduce/Float32/dims=2L 1839395.5 ns 1811750 ns 1.02
array/private/copyto!/gpu_to_gpu 577208 ns 642208 ns 0.90
array/private/copyto!/cpu_to_gpu 718791.5 ns 796688 ns 0.90
array/private/copyto!/gpu_to_cpu 749979.5 ns 807500 ns 0.93
array/private/iteration/findall/int 1840458.5 ns 1567583 ns 1.17
array/private/iteration/findall/bool 1565500 ns 1400584 ns 1.12
array/private/iteration/findfirst/int 2107667 ns 2097125 ns 1.01
array/private/iteration/findfirst/bool 2064416.5 ns 2047416.5 ns 1.01
array/private/iteration/scalar 2841125 ns 3930375 ns 0.72
array/private/iteration/logical 2770334 ns 2624146.5 ns 1.06
array/private/iteration/findmin/1d 2305792 ns 2514125 ns 0.92
array/private/iteration/findmin/2d 1549854 ns 1793520.5 ns 0.86
array/private/copy 900125 ns 571104.5 ns 1.58
array/shared/copyto!/gpu_to_gpu 85500 ns 85708 ns 1.00
array/shared/copyto!/cpu_to_gpu 84292 ns 81542 ns 1.03
array/shared/copyto!/gpu_to_cpu 83583 ns 82333 ns 1.02
array/shared/iteration/findall/int 1825896 ns 1568292 ns 1.16
array/shared/iteration/findall/bool 1579625 ns 1421958 ns 1.11
array/shared/iteration/findfirst/int 1722000 ns 1664250 ns 1.03
array/shared/iteration/findfirst/bool 1667729 ns 1655375 ns 1.01
array/shared/iteration/scalar 209541.5 ns 201375 ns 1.04
array/shared/iteration/logical 2492708 ns 2429250 ns 1.03
array/shared/iteration/findmin/1d 1899958.5 ns 2128417 ns 0.89
array/shared/iteration/findmin/2d 1561250 ns 1798333.5 ns 0.87
array/shared/copy 216667 ns 248750 ns 0.87
array/permutedims/4d 2477250 ns 2382625 ns 1.04
array/permutedims/2d 1212916.5 ns 1176375 ns 1.03
array/permutedims/3d 1798375 ns 1675000 ns 1.07
metal/synchronization/stream 19167 ns 19000 ns 1.01
metal/synchronization/context 20833 ns 20125 ns 1.04

This comment was automatically generated by workflow using github-action-benchmark.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jul 20, 2025

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.
diff --git a/test/runtests.jl b/test/runtests.jl
index 9b6b0c3d..6d16c110 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -11,7 +11,7 @@ if parse(Bool, get(ENV, "BUILDKITE", "false"))
 end
 
 using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="accumulatetests")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "accumulatetests")
 
 # Quit without erroring if Metal loaded without issues on unsupported platforms
 if !Sys.isapple()

@christiangnrd
Copy link
Copy Markdown
Member Author

christiangnrd commented Jul 20, 2025

As expected, some small regressions for most accumulate benchmarks, with a massive regression when accumulating along rows of a 3x1000000 matrix.

The performance improvement with column-wise accumulation with 3x1000000 matrices comes from Metal missing an easy optimization (see #626) Edit: I got confused this optimization is only present for reductions.

@christiangnrd christiangnrd changed the title Switch to GPUArrays.jl accumulate implementation Switch to GPUArrays.jl accumulate implementation Jul 20, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.35%. Comparing base (1942968) to head (b296d15).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #625      +/-   ##
==========================================
- Coverage   80.63%   80.35%   -0.29%     
==========================================
  Files          61       60       -1     
  Lines        2722     2678      -44     
==========================================
- Hits         2195     2152      -43     
+ Misses        527      526       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@christiangnrd christiangnrd changed the title Switch to GPUArrays.jl accumulate implementation [Do not merge] Switch to GPUArrays.jl accumulate implementation Jul 23, 2025
@maleadt
Copy link
Copy Markdown
Member

maleadt commented Jul 29, 2025

As expected, some small regressions for most accumulate benchmarks, with a massive regression when accumulating along rows of a 3x1000000 matrix.

I don't see a massive slowdown?

@christiangnrd
Copy link
Copy Markdown
Member Author

@maleadt The accumulate dims=2L benchmarks show a 2x slowdown. Did I get my rows/columns mixed up in my comment?

@maleadt
Copy link
Copy Markdown
Member

maleadt commented Jul 30, 2025

Oh OK, I didn't consider 2x a "massive slowdown" :-) Still something to look at of course, but much less dramatic than the 7x regressions we e.g. saw against CUDA.jl's reduction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants