Skip to content

[Do not merge] Switch to GPUArrays.jl reduction implementation#628

Open
christiangnrd wants to merge 1 commit intomainfrom
noreduce
Open

[Do not merge] Switch to GPUArrays.jl reduction implementation#628
christiangnrd wants to merge 1 commit intomainfrom
noreduce

Conversation

@christiangnrd
Copy link
Copy Markdown
Member

Don't remove the file yet to avoid merge conflict with #627

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jul 20, 2025

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.
diff --git a/perf/runbenchmarks.jl b/perf/runbenchmarks.jl
index ba5e0d40..1d7901c5 100644
--- a/perf/runbenchmarks.jl
+++ b/perf/runbenchmarks.jl
@@ -1,6 +1,6 @@
 # benchmark suite execution and codespeed submission
 using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="akreduce")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "akreduce")
 
 using Metal
 
diff --git a/test/runtests.jl b/test/runtests.jl
index 4ee51134..fb376e4f 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -6,7 +6,7 @@ import REPL
 using Test
 
 using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="akreduce")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "akreduce")
 
 # Quit without erroring if Metal loaded without issues on unsupported platforms
 if !Sys.isapple()

@christiangnrd
Copy link
Copy Markdown
Member Author

Leaving the current mapreducedim! implementation present, we can transition in two parts. First once AK supports broadcasted reductions, and then remove implementations from this repo after AK supports >1 input dims.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jul 21, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.63%. Comparing base (1942968) to head (c0eddd1).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #628   +/-   ##
=======================================
  Coverage   80.63%   80.63%           
=======================================
  Files          61       61           
  Lines        2722     2722           
=======================================
  Hits         2195     2195           
  Misses        527      527           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Details
Benchmark suite Current: 2e5239f Previous: 28d2eb3 Ratio
latency/precompile 24914872583 ns 25451816292 ns 0.98
latency/ttfp 2147869250 ns 2366668104.5 ns 0.91
latency/import 1234056917 ns 1439498750 ns 0.86
integration/metaldevrt 874458 ns 865458 ns 1.01
integration/byval/slices=1 1570291 ns 1573625 ns 1.00
integration/byval/slices=3 8384667 ns 8948666.5 ns 0.94
integration/byval/reference 1561833 ns 1548416 ns 1.01
integration/byval/slices=2 2637792 ns 2650958 ns 1.00
kernel/indexing 640166.5 ns 647875 ns 0.99
kernel/indexing_checked 616104.5 ns 636375 ns 0.97
kernel/launch 12875 ns 11625 ns 1.11
kernel/rand 579291 ns 578375 ns 1.00
array/construct 6625 ns 6084 ns 1.09
array/broadcast 611062.5 ns 605917 ns 1.01
array/random/randn/Float32 1019604 ns 1008250 ns 1.01
array/random/randn!/Float32 757875 ns 729458 ns 1.04
array/random/rand!/Int64 560125 ns 547959 ns 1.02
array/random/rand!/Float32 589958 ns 597333 ns 0.99
array/random/rand/Int64 763500 ns 743500 ns 1.03
array/random/rand/Float32 620625 ns 636959 ns 0.97
array/accumulate/Int64/1d 1241417 ns 1254333 ns 0.99
array/accumulate/Int64/dims=1 1855917 ns 1840916.5 ns 1.01
array/accumulate/Int64/dims=2 2192625 ns 2154750 ns 1.02
array/accumulate/Int64/dims=1L 11493750 ns 11672583 ns 0.98
array/accumulate/Int64/dims=2L 9823583.5 ns 9846791.5 ns 1.00
array/accumulate/Float32/1d 1132812.5 ns 1104187.5 ns 1.03
array/accumulate/Float32/dims=1 1565292 ns 1543791.5 ns 1.01
array/accumulate/Float32/dims=2 1878958.5 ns 1875562.5 ns 1.00
array/accumulate/Float32/dims=1L 9760312.5 ns 9790417 ns 1.00
array/accumulate/Float32/dims=2L 7244917 ns 7247208 ns 1.00
array/reductions/reduce/Int64/1d 577834 ns 1573749.5 ns 0.37
array/reductions/reduce/Int64/dims=1 1044708 ns 1091271 ns 0.96
array/reductions/reduce/Int64/dims=2 1046417 ns 1133334 ns 0.92
array/reductions/reduce/Int64/dims=1L 2495396 ns 2015499.5 ns 1.24
array/reductions/reduce/Int64/dims=2L 2872562.5 ns 4237625 ns 0.68
array/reductions/reduce/Float32/1d 957791.5 ns 991562.5 ns 0.97
array/reductions/reduce/Float32/dims=1 1043042 ns 837958 ns 1.24
array/reductions/reduce/Float32/dims=2 1069833 ns 865645.5 ns 1.24
array/reductions/reduce/Float32/dims=1L 1796062.5 ns 1332167 ns 1.35
array/reductions/reduce/Float32/dims=2L 2859416 ns 1802521 ns 1.59
array/reductions/mapreduce/Int64/1d 613333 ns 1505209 ns 0.41
array/reductions/mapreduce/Int64/dims=1 1040271 ns 1092375 ns 0.95
array/reductions/mapreduce/Int64/dims=2 1079208 ns 1141166.5 ns 0.95
array/reductions/mapreduce/Int64/dims=1L 2444604.5 ns 2016167 ns 1.21
array/reductions/mapreduce/Int64/dims=2L 2875541 ns 3599334 ns 0.80
array/reductions/mapreduce/Float32/1d 1037666.5 ns 1036333.5 ns 1.00
array/reductions/mapreduce/Float32/dims=1 1049625 ns 845584 ns 1.24
array/reductions/mapreduce/Float32/dims=2 1060542 ns 849958 ns 1.25
array/reductions/mapreduce/Float32/dims=1L 1797542 ns 1305625 ns 1.38
array/reductions/mapreduce/Float32/dims=2L 2863021 ns 1811750 ns 1.58
array/private/copyto!/gpu_to_gpu 634792 ns 642208 ns 0.99
array/private/copyto!/cpu_to_gpu 803750 ns 796688 ns 1.01
array/private/copyto!/gpu_to_cpu 801250 ns 807500 ns 0.99
array/private/iteration/findall/int 1585750 ns 1567583 ns 1.01
array/private/iteration/findall/bool 1407792 ns 1400584 ns 1.01
array/private/iteration/findfirst/int 1913479.5 ns 2097125 ns 0.91
array/private/iteration/findfirst/bool 1549167 ns 2047416.5 ns 0.76
array/private/iteration/scalar 4096792 ns 3930375 ns 1.04
array/private/iteration/logical 2912313 ns 2624146.5 ns 1.11
array/private/iteration/findmin/1d 1898958 ns 2514125 ns 0.76
array/private/iteration/findmin/2d 1874854.5 ns 1793520.5 ns 1.05
array/private/copy 617417 ns 571104.5 ns 1.08
array/shared/copyto!/gpu_to_gpu 86834 ns 85708 ns 1.01
array/shared/copyto!/cpu_to_gpu 85292 ns 81542 ns 1.05
array/shared/copyto!/gpu_to_cpu 88541 ns 82333 ns 1.08
array/shared/iteration/findall/int 1583709 ns 1568292 ns 1.01
array/shared/iteration/findall/bool 1443500 ns 1421958 ns 1.02
array/shared/iteration/findfirst/int 1627354 ns 1664250 ns 0.98
array/shared/iteration/findfirst/bool 1424125 ns 1655375 ns 0.86
array/shared/iteration/scalar 204708 ns 201375 ns 1.02
array/shared/iteration/logical 2863250 ns 2429250 ns 1.18
array/shared/iteration/findmin/1d 1709291 ns 2128417 ns 0.80
array/shared/iteration/findmin/2d 1950500 ns 1798333.5 ns 1.08
array/shared/copy 250458 ns 248750 ns 1.01
array/permutedims/4d 2395167 ns 2382625 ns 1.01
array/permutedims/2d 1043250 ns 1176375 ns 0.89
array/permutedims/3d 1689062 ns 1675000 ns 1.01
metal/synchronization/stream 19792 ns 19000 ns 1.04
metal/synchronization/context 20167 ns 20125 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@christiangnrd christiangnrd changed the title Switch to GPUArrays.jl reduction implementation [Do not merge] Switch to GPUArrays.jl reduction implementation Jul 23, 2025
@maleadt
Copy link
Copy Markdown
Member

maleadt commented Jul 29, 2025

Leaving the current mapreducedim! implementation present, we can transition in two parts. First once AK supports broadcasted reductions, and then remove implementations from this repo after AK supports >1 input dims.

I think I'd rather we do it in one pass, because the change needs to be made across back-ends.

@maleadt
Copy link
Copy Markdown
Member

maleadt commented Jul 30, 2025

In any case, despite some regressions the overall performance seems better here than over in CUDA.jl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants