[Do not merge] Switch to GPUArrays.jl reduction implementation#628
[Do not merge] Switch to GPUArrays.jl reduction implementation#628christiangnrd wants to merge 1 commit intomainfrom
Conversation
|
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/perf/runbenchmarks.jl b/perf/runbenchmarks.jl
index ba5e0d40..1d7901c5 100644
--- a/perf/runbenchmarks.jl
+++ b/perf/runbenchmarks.jl
@@ -1,6 +1,6 @@
# benchmark suite execution and codespeed submission
using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="akreduce")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "akreduce")
using Metal
diff --git a/test/runtests.jl b/test/runtests.jl
index 4ee51134..fb376e4f 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -6,7 +6,7 @@ import REPL
using Test
using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="akreduce")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "akreduce")
# Quit without erroring if Metal loaded without issues on unsupported platforms
if !Sys.isapple() |
|
Leaving the current |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #628 +/- ##
=======================================
Coverage 80.63% 80.63%
=======================================
Files 61 61
Lines 2722 2722
=======================================
Hits 2195 2195
Misses 527 527 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Metal Benchmarks
Details
| Benchmark suite | Current: 2e5239f | Previous: 28d2eb3 | Ratio |
|---|---|---|---|
latency/precompile |
24914872583 ns |
25451816292 ns |
0.98 |
latency/ttfp |
2147869250 ns |
2366668104.5 ns |
0.91 |
latency/import |
1234056917 ns |
1439498750 ns |
0.86 |
integration/metaldevrt |
874458 ns |
865458 ns |
1.01 |
integration/byval/slices=1 |
1570291 ns |
1573625 ns |
1.00 |
integration/byval/slices=3 |
8384667 ns |
8948666.5 ns |
0.94 |
integration/byval/reference |
1561833 ns |
1548416 ns |
1.01 |
integration/byval/slices=2 |
2637792 ns |
2650958 ns |
1.00 |
kernel/indexing |
640166.5 ns |
647875 ns |
0.99 |
kernel/indexing_checked |
616104.5 ns |
636375 ns |
0.97 |
kernel/launch |
12875 ns |
11625 ns |
1.11 |
kernel/rand |
579291 ns |
578375 ns |
1.00 |
array/construct |
6625 ns |
6084 ns |
1.09 |
array/broadcast |
611062.5 ns |
605917 ns |
1.01 |
array/random/randn/Float32 |
1019604 ns |
1008250 ns |
1.01 |
array/random/randn!/Float32 |
757875 ns |
729458 ns |
1.04 |
array/random/rand!/Int64 |
560125 ns |
547959 ns |
1.02 |
array/random/rand!/Float32 |
589958 ns |
597333 ns |
0.99 |
array/random/rand/Int64 |
763500 ns |
743500 ns |
1.03 |
array/random/rand/Float32 |
620625 ns |
636959 ns |
0.97 |
array/accumulate/Int64/1d |
1241417 ns |
1254333 ns |
0.99 |
array/accumulate/Int64/dims=1 |
1855917 ns |
1840916.5 ns |
1.01 |
array/accumulate/Int64/dims=2 |
2192625 ns |
2154750 ns |
1.02 |
array/accumulate/Int64/dims=1L |
11493750 ns |
11672583 ns |
0.98 |
array/accumulate/Int64/dims=2L |
9823583.5 ns |
9846791.5 ns |
1.00 |
array/accumulate/Float32/1d |
1132812.5 ns |
1104187.5 ns |
1.03 |
array/accumulate/Float32/dims=1 |
1565292 ns |
1543791.5 ns |
1.01 |
array/accumulate/Float32/dims=2 |
1878958.5 ns |
1875562.5 ns |
1.00 |
array/accumulate/Float32/dims=1L |
9760312.5 ns |
9790417 ns |
1.00 |
array/accumulate/Float32/dims=2L |
7244917 ns |
7247208 ns |
1.00 |
array/reductions/reduce/Int64/1d |
577834 ns |
1573749.5 ns |
0.37 |
array/reductions/reduce/Int64/dims=1 |
1044708 ns |
1091271 ns |
0.96 |
array/reductions/reduce/Int64/dims=2 |
1046417 ns |
1133334 ns |
0.92 |
array/reductions/reduce/Int64/dims=1L |
2495396 ns |
2015499.5 ns |
1.24 |
array/reductions/reduce/Int64/dims=2L |
2872562.5 ns |
4237625 ns |
0.68 |
array/reductions/reduce/Float32/1d |
957791.5 ns |
991562.5 ns |
0.97 |
array/reductions/reduce/Float32/dims=1 |
1043042 ns |
837958 ns |
1.24 |
array/reductions/reduce/Float32/dims=2 |
1069833 ns |
865645.5 ns |
1.24 |
array/reductions/reduce/Float32/dims=1L |
1796062.5 ns |
1332167 ns |
1.35 |
array/reductions/reduce/Float32/dims=2L |
2859416 ns |
1802521 ns |
1.59 |
array/reductions/mapreduce/Int64/1d |
613333 ns |
1505209 ns |
0.41 |
array/reductions/mapreduce/Int64/dims=1 |
1040271 ns |
1092375 ns |
0.95 |
array/reductions/mapreduce/Int64/dims=2 |
1079208 ns |
1141166.5 ns |
0.95 |
array/reductions/mapreduce/Int64/dims=1L |
2444604.5 ns |
2016167 ns |
1.21 |
array/reductions/mapreduce/Int64/dims=2L |
2875541 ns |
3599334 ns |
0.80 |
array/reductions/mapreduce/Float32/1d |
1037666.5 ns |
1036333.5 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1 |
1049625 ns |
845584 ns |
1.24 |
array/reductions/mapreduce/Float32/dims=2 |
1060542 ns |
849958 ns |
1.25 |
array/reductions/mapreduce/Float32/dims=1L |
1797542 ns |
1305625 ns |
1.38 |
array/reductions/mapreduce/Float32/dims=2L |
2863021 ns |
1811750 ns |
1.58 |
array/private/copyto!/gpu_to_gpu |
634792 ns |
642208 ns |
0.99 |
array/private/copyto!/cpu_to_gpu |
803750 ns |
796688 ns |
1.01 |
array/private/copyto!/gpu_to_cpu |
801250 ns |
807500 ns |
0.99 |
array/private/iteration/findall/int |
1585750 ns |
1567583 ns |
1.01 |
array/private/iteration/findall/bool |
1407792 ns |
1400584 ns |
1.01 |
array/private/iteration/findfirst/int |
1913479.5 ns |
2097125 ns |
0.91 |
array/private/iteration/findfirst/bool |
1549167 ns |
2047416.5 ns |
0.76 |
array/private/iteration/scalar |
4096792 ns |
3930375 ns |
1.04 |
array/private/iteration/logical |
2912313 ns |
2624146.5 ns |
1.11 |
array/private/iteration/findmin/1d |
1898958 ns |
2514125 ns |
0.76 |
array/private/iteration/findmin/2d |
1874854.5 ns |
1793520.5 ns |
1.05 |
array/private/copy |
617417 ns |
571104.5 ns |
1.08 |
array/shared/copyto!/gpu_to_gpu |
86834 ns |
85708 ns |
1.01 |
array/shared/copyto!/cpu_to_gpu |
85292 ns |
81542 ns |
1.05 |
array/shared/copyto!/gpu_to_cpu |
88541 ns |
82333 ns |
1.08 |
array/shared/iteration/findall/int |
1583709 ns |
1568292 ns |
1.01 |
array/shared/iteration/findall/bool |
1443500 ns |
1421958 ns |
1.02 |
array/shared/iteration/findfirst/int |
1627354 ns |
1664250 ns |
0.98 |
array/shared/iteration/findfirst/bool |
1424125 ns |
1655375 ns |
0.86 |
array/shared/iteration/scalar |
204708 ns |
201375 ns |
1.02 |
array/shared/iteration/logical |
2863250 ns |
2429250 ns |
1.18 |
array/shared/iteration/findmin/1d |
1709291 ns |
2128417 ns |
0.80 |
array/shared/iteration/findmin/2d |
1950500 ns |
1798333.5 ns |
1.08 |
array/shared/copy |
250458 ns |
248750 ns |
1.01 |
array/permutedims/4d |
2395167 ns |
2382625 ns |
1.01 |
array/permutedims/2d |
1043250 ns |
1176375 ns |
0.89 |
array/permutedims/3d |
1689062 ns |
1675000 ns |
1.01 |
metal/synchronization/stream |
19792 ns |
19000 ns |
1.04 |
metal/synchronization/context |
20167 ns |
20125 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
I think I'd rather we do it in one pass, because the change needs to be made across back-ends. |
|
In any case, despite some regressions the overall performance seems better here than over in CUDA.jl. |
Don't remove the file yet to avoid merge conflict with #627