[Do not merge] Switch to GPUArrays.jl accumulate implementation#625
[Do not merge] Switch to GPUArrays.jl accumulate implementation#625christiangnrd wants to merge 3 commits intomainfrom
accumulate implementation#625Conversation
There was a problem hiding this comment.
Metal Benchmarks
Details
| Benchmark suite | Current: 7747983 | Previous: 28d2eb3 | Ratio |
|---|---|---|---|
latency/precompile |
25220740333 ns |
25451816292 ns |
0.99 |
latency/ttfp |
2185103083 ns |
2366668104.5 ns |
0.92 |
latency/import |
1255539541.5 ns |
1439498750 ns |
0.87 |
integration/metaldevrt |
861687.5 ns |
865458 ns |
1.00 |
integration/byval/slices=1 |
1604917 ns |
1573625 ns |
1.02 |
integration/byval/slices=3 |
20285021.5 ns |
8948666.5 ns |
2.27 |
integration/byval/reference |
1582459 ns |
1548416 ns |
1.02 |
integration/byval/slices=2 |
2754083 ns |
2650958 ns |
1.04 |
kernel/indexing |
486458 ns |
647875 ns |
0.75 |
kernel/indexing_checked |
483271 ns |
636375 ns |
0.76 |
kernel/launch |
13292 ns |
11625 ns |
1.14 |
kernel/rand |
546084 ns |
578375 ns |
0.94 |
array/construct |
6750 ns |
6084 ns |
1.11 |
array/broadcast |
563917 ns |
605917 ns |
0.93 |
array/random/randn/Float32 |
1034166 ns |
1008250 ns |
1.03 |
array/random/randn!/Float32 |
725875 ns |
729458 ns |
1.00 |
array/random/rand!/Int64 |
542750 ns |
547959 ns |
0.99 |
array/random/rand!/Float32 |
544583 ns |
597333 ns |
0.91 |
array/random/rand/Int64 |
917437.5 ns |
743500 ns |
1.23 |
array/random/rand/Float32 |
832562.5 ns |
636959 ns |
1.31 |
array/accumulate/Int64/1d |
2426271 ns |
1254333 ns |
1.93 |
array/accumulate/Int64/dims=1 |
2327479.5 ns |
1840916.5 ns |
1.26 |
array/accumulate/Int64/dims=2 |
2542125 ns |
2154750 ns |
1.18 |
array/accumulate/Int64/dims=1L |
6704167 ns |
11672583 ns |
0.57 |
array/accumulate/Int64/dims=2L |
18939313 ns |
9846791.5 ns |
1.92 |
array/accumulate/Float32/1d |
1804625 ns |
1104187.5 ns |
1.63 |
array/accumulate/Float32/dims=1 |
2078459 ns |
1543791.5 ns |
1.35 |
array/accumulate/Float32/dims=2 |
2368062.5 ns |
1875562.5 ns |
1.26 |
array/accumulate/Float32/dims=1L |
5188646 ns |
9790417 ns |
0.53 |
array/accumulate/Float32/dims=2L |
15447000 ns |
7247208 ns |
2.13 |
array/reductions/reduce/Int64/1d |
1308958.5 ns |
1573749.5 ns |
0.83 |
array/reductions/reduce/Int64/dims=1 |
1133541.5 ns |
1091271 ns |
1.04 |
array/reductions/reduce/Int64/dims=2 |
1155750 ns |
1133334 ns |
1.02 |
array/reductions/reduce/Int64/dims=1L |
2073625 ns |
2015499.5 ns |
1.03 |
array/reductions/reduce/Int64/dims=2L |
4038479 ns |
4237625 ns |
0.95 |
array/reductions/reduce/Float32/1d |
756833 ns |
991562.5 ns |
0.76 |
array/reductions/reduce/Float32/dims=1 |
820062 ns |
837958 ns |
0.98 |
array/reductions/reduce/Float32/dims=2 |
840458 ns |
865645.5 ns |
0.97 |
array/reductions/reduce/Float32/dims=1L |
1369479 ns |
1332167 ns |
1.03 |
array/reductions/reduce/Float32/dims=2L |
1833979 ns |
1802521 ns |
1.02 |
array/reductions/mapreduce/Int64/1d |
1328417 ns |
1505209 ns |
0.88 |
array/reductions/mapreduce/Int64/dims=1 |
1143542 ns |
1092375 ns |
1.05 |
array/reductions/mapreduce/Int64/dims=2 |
1182709 ns |
1141166.5 ns |
1.04 |
array/reductions/mapreduce/Int64/dims=1L |
1985958 ns |
2016167 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=2L |
3659250 ns |
3599334 ns |
1.02 |
array/reductions/mapreduce/Float32/1d |
742834 ns |
1036333.5 ns |
0.72 |
array/reductions/mapreduce/Float32/dims=1 |
828521 ns |
845584 ns |
0.98 |
array/reductions/mapreduce/Float32/dims=2 |
861229.5 ns |
849958 ns |
1.01 |
array/reductions/mapreduce/Float32/dims=1L |
1349125 ns |
1305625 ns |
1.03 |
array/reductions/mapreduce/Float32/dims=2L |
1839395.5 ns |
1811750 ns |
1.02 |
array/private/copyto!/gpu_to_gpu |
577208 ns |
642208 ns |
0.90 |
array/private/copyto!/cpu_to_gpu |
718791.5 ns |
796688 ns |
0.90 |
array/private/copyto!/gpu_to_cpu |
749979.5 ns |
807500 ns |
0.93 |
array/private/iteration/findall/int |
1840458.5 ns |
1567583 ns |
1.17 |
array/private/iteration/findall/bool |
1565500 ns |
1400584 ns |
1.12 |
array/private/iteration/findfirst/int |
2107667 ns |
2097125 ns |
1.01 |
array/private/iteration/findfirst/bool |
2064416.5 ns |
2047416.5 ns |
1.01 |
array/private/iteration/scalar |
2841125 ns |
3930375 ns |
0.72 |
array/private/iteration/logical |
2770334 ns |
2624146.5 ns |
1.06 |
array/private/iteration/findmin/1d |
2305792 ns |
2514125 ns |
0.92 |
array/private/iteration/findmin/2d |
1549854 ns |
1793520.5 ns |
0.86 |
array/private/copy |
900125 ns |
571104.5 ns |
1.58 |
array/shared/copyto!/gpu_to_gpu |
85500 ns |
85708 ns |
1.00 |
array/shared/copyto!/cpu_to_gpu |
84292 ns |
81542 ns |
1.03 |
array/shared/copyto!/gpu_to_cpu |
83583 ns |
82333 ns |
1.02 |
array/shared/iteration/findall/int |
1825896 ns |
1568292 ns |
1.16 |
array/shared/iteration/findall/bool |
1579625 ns |
1421958 ns |
1.11 |
array/shared/iteration/findfirst/int |
1722000 ns |
1664250 ns |
1.03 |
array/shared/iteration/findfirst/bool |
1667729 ns |
1655375 ns |
1.01 |
array/shared/iteration/scalar |
209541.5 ns |
201375 ns |
1.04 |
array/shared/iteration/logical |
2492708 ns |
2429250 ns |
1.03 |
array/shared/iteration/findmin/1d |
1899958.5 ns |
2128417 ns |
0.89 |
array/shared/iteration/findmin/2d |
1561250 ns |
1798333.5 ns |
0.87 |
array/shared/copy |
216667 ns |
248750 ns |
0.87 |
array/permutedims/4d |
2477250 ns |
2382625 ns |
1.04 |
array/permutedims/2d |
1212916.5 ns |
1176375 ns |
1.03 |
array/permutedims/3d |
1798375 ns |
1675000 ns |
1.07 |
metal/synchronization/stream |
19167 ns |
19000 ns |
1.01 |
metal/synchronization/context |
20833 ns |
20125 ns |
1.04 |
This comment was automatically generated by workflow using github-action-benchmark.
|
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/test/runtests.jl b/test/runtests.jl
index 9b6b0c3d..6d16c110 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -11,7 +11,7 @@ if parse(Bool, get(ENV, "BUILDKITE", "false"))
end
using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="accumulatetests")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "accumulatetests")
# Quit without erroring if Metal loaded without issues on unsupported platforms
if !Sys.isapple() |
|
As expected, some small regressions for most accumulate benchmarks, with a massive regression when accumulating along rows of a 3x1000000 matrix.
|
accumulate implementation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #625 +/- ##
==========================================
- Coverage 80.63% 80.35% -0.29%
==========================================
Files 61 60 -1
Lines 2722 2678 -44
==========================================
- Hits 2195 2152 -43
+ Misses 527 526 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
accumulate implementationaccumulate implementation
I don't see a massive slowdown? |
|
@maleadt The accumulate |
|
Oh OK, I didn't consider 2x a "massive slowdown" :-) Still something to look at of course, but much less dramatic than the 7x regressions we e.g. saw against CUDA.jl's reduction. |
b296d15 to
84f519a
Compare
Opened to run benchmarks.
Todo: