Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -37,14 +37,17 @@ WriteVTK = "64499a7a-5c06-52f2-abe2-ccb03c286192"
[weakdeps]
OrdinaryDiffEq = "1dea7af3-3e70-54e6-95c3-0bf5283fa5ed"
OrdinaryDiffEqCore = "bbf590c4-e513-4bbe-9b18-05decba2e5d8"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"

[extensions]
TrixiParticlesOrdinaryDiffEqExt = ["OrdinaryDiffEq", "OrdinaryDiffEqCore"]
TrixiParticlesCUDAExt = "CUDA"

[compat]
Accessors = "0.1.43"
Adapt = "4"
CSV = "0.10"
CUDA = "5.9.1"
DataFrames = "1.6"
DelimitedFiles = "1"
DiffEqCallbacks = "4"
Expand Down
42 changes: 42 additions & 0 deletions docs/src/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,3 +66,45 @@ To create a new release for TrixiParticles.jl, perform the following steps:
version should be `v0.3.1-dev`. If you just released `v0.2.4`, the new development
version should be `v0.2.5-dev`.

## [Writing GPU-compatible code](@id writing_gpu_code)

When implementing new functionality that should run on both CPUs and GPUs,
follow these rules:

1. Data structures must be generic and parametric.
Do not hardcode concrete CPU array types like `Vector` or `Matrix` in fields.
Use type parameters, so the same structure can store CPU arrays and GPU arrays.
2. Add an Adapt.jl rule in `src/general/gpu.jl`.
Register the new type with `Adapt.@adapt_structure ...`, so `adapt` can recursively
convert all arrays inside the structure to GPU arrays.
This conversion is then applied automatically inside `semidiscretize`.
3. Use `@threaded` for all loops.
Accessing GPU arrays inside regular loops is not allowed.
With a GPU backend, `@threaded` loops are compiled to GPU kernels.
4. Write type-stable code and do not allocate inside `@threaded` loops.
This is required for GPU kernels and is also essential for fast multithreaded CPU code.

## [Writing fast GPU code](@id writing_fast_gpu_code)

The following rules improve kernel performance and avoid common GPU pitfalls:

1. Avoid exceptions and bounds errors inside kernels.
Perform all required checks before entering `@threaded` loops (that is, before GPU kernels).
Then use `@inbounds` directly at the loop where bounds are guaranteed.
In TrixiParticles.jl, we do not place `@inbounds` inside inner helper functions.
Instead, mark helper functions with `@propagate_inbounds` so the loop-level
`@inbounds` is propagated.
2. Avoid implicit `Float64` literals in arithmetic.
For example, prefer `x / 2` over `0.5 * x` so `Float32` simulations stay in `Float32`.
Verify this with `@device_code`, or by confirming the kernel runs on an Apple GPU
(most Apple GPUs do not support `Float64`).
3. Use `div_fast` in performance-critical divisions, but only after benchmarking (!).
It can significantly speed up kernels, but should not be applied indiscriminately.
When introducing `div_fast` in code, add a reference to [this section](@ref writing_fast_gpu_code)
to document the rationale and benchmarking context, e.g., like so:
```julia
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
result = div_fast(dividend, divisor)
```
5 changes: 5 additions & 0 deletions docs/src/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,3 +137,8 @@ On GPUs that do not support `Float64`, such as most Apple GPUs, we also need to
the coordinates to `Float32` by passing `coordinates_eltype=Float32` to
the setup functions that create [`InitialCondition`](@ref)s, such as
[`RectangularTank`](@ref), [`RectangularShape`](@ref), and [`SphereShape`](@ref).

## Writing GPU-compatible code

Please see the [development documentation](@ref writing_gpu_code) for guidelines on
how to write GPU-compatible code.
34 changes: 34 additions & 0 deletions ext/TrixiParticlesCUDAExt.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# TODO this might be integrated into CUDA.jl at some point, see
# https://github.com/JuliaGPU/CUDA.jl/pull/3077
Comment thread
efaulhaber marked this conversation as resolved.
module TrixiParticlesCUDAExt

using CUDA: CUDA
using TrixiParticles: TrixiParticles

# Use faster version of `div_fast` for `Float64` on CUDA.
# By default, `div_fast` translates to `Base.FastMath.div_fast`, but there is
# no fast division for `Float64` on CUDA, so we need to redefine it here to use the
# improved fast reciprocal defined below.
CUDA.@device_override TrixiParticles.div_fast(x, y::Float64) = x * fast_inv_cuda(y)

# Improved fast reciprocal for `Float64` by @Mikolaj-A-Kowalski, which is significantly
# more accurate than just calling "llvm.nvvm.rcp.approx.ftz.d" without the cubic iteration,
# while still being much faster than a full division.
# This is copied from Oceananigans.jl, see https://github.com/CliMA/Oceananigans.jl/pull/5140.
@inline function fast_inv_cuda(a::Float64)
# Get the approximate reciprocal
# https://docs.nvidia.com/cuda/parallel-thread-execution/#floating-point-instructions-rcp-approx-ftz-f64
# This instruction chops off last 32bits of mantissa and computes inverse
# while treating all subnormal numbers as 0.0
# If reciprocal would be subnormal, underflows to 0.0
# 32 least significant bits of the result are filled with 0s
inv_a = ccall("llvm.nvvm.rcp.approx.ftz.d", llvmcall, Float64, (Float64,), a)

# Approximate the missing 32bits of mantissa with a single cubic iteration
e = fma(inv_a, -a, 1.0)
e = fma(e, e, e)
inv_a = fma(e, inv_a, inv_a)
return inv_a
end

end # module
5 changes: 4 additions & 1 deletion src/schemes/fluid/fluid.jl
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,10 @@ end
particle_system, neighbor_system,
particle, neighbor, rho_a, rho_b)

dv[end, particle] += rho_a / rho_b * m_b * dot(vdiff, grad_kernel)
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
dv[end, particle] += div_fast(rho_a, rho_b) * m_b * dot(vdiff, grad_kernel)

# Artificial density diffusion should only be applied to systems representing a fluid
# with the same physical properties i.e. density and viscosity.
Expand Down
20 changes: 16 additions & 4 deletions src/schemes/fluid/pressure_acceleration.jl
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,20 @@
# asymmetric version.
@inline function pressure_acceleration_summation_density(m_a, m_b, rho_a, rho_b, p_a, p_b,
W_a)
return -m_b * (p_a / rho_a^2 + p_b / rho_b^2) * W_a
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
return -m_b * (div_fast(p_a, rho_a^2) + div_fast(p_b, rho_b^2)) * W_a
end

# Same as above, but not assuming symmetry of the kernel gradient. To be used with
# corrections that do not produce a symmetric kernel gradient.
@inline function pressure_acceleration_summation_density(m_a, m_b, rho_a, rho_b, p_a, p_b,
W_a, W_b)
return -m_b * (p_a / rho_a^2 * W_a - p_b / rho_b^2 * W_b)
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
return -m_b * (div_fast(p_a, rho_a^2) * W_a - div_fast(p_b, rho_b^2) * W_b)
end

# As shown in "Variational and momentum preservation aspects of Smooth Particle Hydrodynamic
Expand All @@ -26,14 +32,20 @@ end
# asymmetric version.
@inline function pressure_acceleration_continuity_density(m_a, m_b, rho_a, rho_b, p_a, p_b,
W_a)
return -m_b * (p_a + p_b) / (rho_a * rho_b) * W_a
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
return -m_b * div_fast(p_a + p_b, rho_a * rho_b) * W_a
end

# Same as above, but not assuming symmetry of the kernel gradient. To be used with
# corrections that do not produce a symmetric kernel gradient.
@inline function pressure_acceleration_continuity_density(m_a, m_b, rho_a, rho_b, p_a, p_b,
W_a, W_b)
return -m_b / (rho_a * rho_b) * (p_a * W_a - p_b * W_b)
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
return -div_fast(m_b, rho_a * rho_b) * (p_a * W_a - p_b * W_b)
end

@doc raw"""
Expand Down
60 changes: 40 additions & 20 deletions src/schemes/fluid/viscosity.jl
Original file line number Diff line number Diff line change
Expand Up @@ -127,8 +127,11 @@ end
# approaching particles and turn it off for receding particles. In this way, the
# viscosity is used for shocks and not rarefactions."
if vr < 0
mu = h * vr / (distance^2 + epsilon * h^2)
return (alpha * c * mu + beta * mu^2) / rho_mean * grad_kernel
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
mu = div_fast(h * vr, distance^2 + epsilon * h^2)
return div_fast(alpha * c * mu + beta * mu^2, rho_mean) * grad_kernel
end

return zero(v_diff)
Expand All @@ -142,8 +145,11 @@ end
mu_a = nu_a * rho_a
mu_b = nu_b * rho_b

return (mu_a + mu_b) / (rho_a * rho_b) * dot(pos_diff, grad_kernel) /
(distance^2 + epsilon * h^2) * v_diff
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
return div_fast((mu_a + mu_b) * dot(pos_diff, grad_kernel),
rho_a * rho_b * (distance^2 + epsilon * h^2)) * v_diff
end

# See, e.g.,
Expand Down Expand Up @@ -177,17 +183,20 @@ struct ViscosityAdami{ELTYPE}
end
end

function adami_viscosity_force(smoothing_length_average, pos_diff, distance, grad_kernel,
function adami_viscosity_force(h, pos_diff, distance, grad_kernel,
m_a, m_b, rho_a, rho_b, v_diff, nu_a, nu_b, epsilon)
eta_a = nu_a * rho_a
eta_b = nu_b * rho_b

eta_tilde = 2 * (eta_a * eta_b) / (eta_a + eta_b)
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
volume_a = div_fast(m_a, rho_a)
volume_b = div_fast(m_b, rho_b)

tmp = eta_tilde / (distance^2 + epsilon * smoothing_length_average^2)

volume_a = m_a / rho_a
volume_b = m_b / rho_b
# eta_tilde = 2 * (eta_a * eta_b) / (eta_a + eta_b)
# tmp = eta_tilde / (distance^2 + epsilon * h^2) / m_a
tmp = div_fast(2 * eta_a * eta_b, (eta_a + eta_b) * (distance^2 + epsilon * h^2) * m_a)

# This formulation was introduced by Hu and Adams (2006). https://doi.org/10.1016/j.jcp.2005.09.001
# They argued that the formulation is more flexible because of the possibility to formulate
Expand All @@ -198,9 +207,9 @@ function adami_viscosity_force(smoothing_length_average, pos_diff, distance, gra
# Because when using this formulation for the pressure acceleration, it is not
# energy conserving.
# See issue: https://github.com/trixi-framework/TrixiParticles.jl/issues/394
visc = (volume_a^2 + volume_b^2) * dot(grad_kernel, pos_diff) * tmp / m_a
visc = (volume_a^2 + volume_b^2) * dot(grad_kernel, pos_diff) * tmp

return visc .* v_diff
return visc * v_diff
end

@inline function (viscosity::ViscosityAdami)(particle_system, neighbor_system,
Expand Down Expand Up @@ -334,7 +343,10 @@ ViscosityAdamiSGS(; nu, C_S=0.1, epsilon=0.001) = ViscosityAdamiSGS(nu, C_S, eps
# and then the Smagorinsky eddy viscosity:
# ν_SGS = (C_S * h̄)^2 * S_mag.
#
S_mag = norm(v_diff) / (distance + epsilon)
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
S_mag = div_fast(sqrt(dot(v_diff, v_diff)), (distance + epsilon))
nu_SGS = (viscosity.C_S * smoothing_length_average)^2 * S_mag

# Effective kinematic viscosity is the sum of the standard and SGS parts.
Expand Down Expand Up @@ -412,7 +424,7 @@ ViscosityMorrisSGS(; nu, C_S=0.1, epsilon=0.001) = ViscosityMorrisSGS(nu, C_S, e

smoothing_length_particle = smoothing_length(particle_system, particle)
smoothing_length_neighbor = smoothing_length(particle_system, neighbor)
smoothing_length_average = (smoothing_length_particle + smoothing_length_neighbor) / 2
h = (smoothing_length_particle + smoothing_length_neighbor) / 2

nu_a = kinematic_viscosity(particle_system,
viscosity_model(neighbor_system, particle_system),
Expand All @@ -427,8 +439,11 @@ ViscosityMorrisSGS(; nu, C_S=0.1, epsilon=0.001) = ViscosityMorrisSGS(nu, C_S, e

# SGS part: Compute the subgrid-scale eddy viscosity.
# See comments above for `ViscosityAdamiSGS`.
S_mag = norm(v_diff) / (distance + epsilon)
nu_SGS = (viscosity.C_S * smoothing_length_average)^2 * S_mag
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
S_mag = div_fast(sqrt(dot(v_diff, v_diff)), (distance + epsilon))
nu_SGS = (viscosity.C_S * h)^2 * S_mag

# Effective viscosities include the SGS term.
nu_a_eff = nu_a + nu_SGS
Expand All @@ -438,9 +453,11 @@ ViscosityMorrisSGS(; nu, C_S=0.1, epsilon=0.001) = ViscosityMorrisSGS(nu, C_S, e
mu_a = nu_a_eff * rho_a
mu_b = nu_b_eff * rho_b

force_Morris = (mu_a + mu_b) / (rho_a * rho_b) * (dot(pos_diff, grad_kernel)) /
(distance^2 + epsilon * smoothing_length_average^2) * v_diff
return m_b * force_Morris
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
return div_fast(m_b * (mu_a + mu_b) * dot(pos_diff, grad_kernel),
rho_a * rho_b * (distance^2 + epsilon * h^2)) * v_diff
end

function kinematic_viscosity(system, viscosity::ViscosityMorrisSGS, smoothing_length,
Expand Down Expand Up @@ -496,7 +513,10 @@ end
v_b = viscous_velocity(v_neighbor_system, neighbor_system, neighbor)
v_diff = v_a - v_b

gamma_dot = norm(v_diff) / (distance + epsilon)
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
gamma_dot = div_fast(sqrt(dot(v_diff, v_diff)), (distance + epsilon))

# Compute Carreau-Yasuda effective viscosity
(; nu0, nu_inf, lambda, a, n) = viscosity
Expand Down
48 changes: 28 additions & 20 deletions src/schemes/fluid/weakly_compressible_sph/density_diffusion.jl
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,10 @@ end

@inline function density_diffusion_psi(::DensityDiffusionMolteniColagrossi, rho_a, rho_b,
pos_diff, distance, system, particle, neighbor)
return 2 * (rho_a - rho_b) * pos_diff / distance^2
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
return div_fast(2 * (rho_a - rho_b), distance^2) * pos_diff
end

@doc raw"""
Expand Down Expand Up @@ -77,9 +80,11 @@ end

@inline function density_diffusion_psi(::DensityDiffusionFerrari, rho_a, rho_b,
pos_diff, distance, system, particle, neighbor)
return ((rho_a - rho_b) /
(smoothing_length(system, particle) + smoothing_length(system, neighbor))) *
pos_diff / distance
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
h = smoothing_length(system, particle) + smoothing_length(system, neighbor)
return div_fast((rho_a - rho_b), h * distance) * pos_diff
end

@doc raw"""
Expand Down Expand Up @@ -154,21 +159,23 @@ function allocate_buffer(ic, dd::DensityDiffusionAntuono, buffer::SystemBuffer)
return initial_condition, DensityDiffusionAntuono(initial_condition; delta=dd.delta)
end

@inline function density_diffusion_psi(density_diffusion::DensityDiffusionAntuono,
rho_a, rho_b, pos_diff, distance, system,
particle, neighbor)
@propagate_inbounds function density_diffusion_psi(density_diffusion::DensityDiffusionAntuono,
rho_a, rho_b, pos_diff, distance, system,
particle, neighbor)
(; normalized_density_gradient) = density_diffusion

normalized_gradient_a = extract_svector(normalized_density_gradient, system, particle)
normalized_gradient_b = extract_svector(normalized_density_gradient, system, neighbor)

# Fist term by Molteni & Colagrossi
# First term by Molteni & Colagrossi
result = 2 * (rho_a - rho_b)

# Second correction term
normalized_gradient_a = extract_svector(normalized_density_gradient, system, particle)
normalized_gradient_b = extract_svector(normalized_density_gradient, system, neighbor)
result -= dot(normalized_gradient_a + normalized_gradient_b, pos_diff)

return result * pos_diff / distance^2
# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
return div_fast(result, distance^2) * pos_diff
end

function update!(density_diffusion::DensityDiffusionAntuono, v, u, system, semi)
Expand Down Expand Up @@ -217,19 +224,20 @@ end
# See `src/general/smoothing_kernels.jl` for more details.
distance^2 < eps(initial_smoothing_length(particle_system)^2) && return

(; delta) = density_diffusion
sound_speed = system_sound_speed(particle_system)

volume_b = m_b / rho_b

# Since this is one of the most performance critical functions, using fast divisions
# here gives a significant speedup on GPUs.
# See the docs page "Development" for more details on `div_fast`.
volume_b = div_fast(m_b, rho_b)
psi = density_diffusion_psi(density_diffusion, rho_a, rho_b, pos_diff, distance,
particle_system, particle, neighbor)
density_diffusion_term = dot(psi, grad_kernel) * volume_b
density_diffusion_term = volume_b * dot(psi, grad_kernel)

smoothing_length_avg = (smoothing_length(particle_system, particle) +
smoothing_length(particle_system, neighbor)) / 2
dv[end, particle] += delta * smoothing_length_avg * sound_speed *
density_diffusion_term

(; delta) = density_diffusion
sound_speed = system_sound_speed(particle_system)
dv[end, particle] += delta * smoothing_length_avg * sound_speed * density_diffusion_term
end

# Density diffusion `nothing` or interaction other than fluid-fluid
Expand Down
6 changes: 6 additions & 0 deletions src/util.jl
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
# By default, use `div_fast` of `Base.FastMath`.
# In the `TrixiParticlesCUDAExt` extension, this is redefined for `Float64`.
@inline function div_fast(x, y)
return Base.FastMath.div_fast(x, y)
end

# Same as `foreach`, but it is unrolled by the compiler for small input tuples
@inline function foreach_noalloc(func, collection)
element = first(collection)
Expand Down
Loading
Loading