Skip to content

Commit 732cfef

Browse files
luciechois-perronbogner
authored
[0041] New proposal for testing-maximal-reconvergence (#376)
This proposal suggests a different approach for having comprehensive test coverage for maximal reconvergence feature in the Clang compiler. --------- Co-authored-by: Steven Perron <[email protected]> Co-authored-by: Justin Bogner <[email protected]>
1 parent 08f7fe9 commit 732cfef

1 file changed

Lines changed: 295 additions & 0 deletions

File tree

Lines changed: 295 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,295 @@
1+
---
2+
title: "0041 - Testing Maximal Reconvergence"
3+
params:
4+
authors:
5+
- luciechoi: Lucie Choi
6+
sponsors:
7+
- s-perron: Steven Perron
8+
- Keenuts: Nathan Gauër
9+
- bogner: Justin Bogner
10+
status: Under Consideration
11+
---
12+
13+
* PRs: [Testing in offload-test-suite
14+
(Draft)](https://github.com/llvm/offload-test-suite/pull/685)
15+
* Issues: [Implementation in
16+
Clang](https://github.com/llvm/llvm-project/issues/136930)
17+
18+
## Introduction
19+
20+
This proposal seeks to add comprehensive conformance tests that HLSL compilers
21+
(DXC and Clang) do not violate the optimization restrictions in section [1.6.3
22+
of the HLSL
23+
specification](https://microsoft.github.io/hlsl-specs/specs/hlsl.pdf?#page=9).
24+
25+
## Motivation
26+
27+
Graphics compilers often perform aggressive optimizations that can unexpectedly
28+
alter the state of a thread in a wave. This is a critical issue for shaders
29+
containing operations dependent on which threads are active, such as wave
30+
intrinsics, as invalid transformations can lead to wrong or indeterminate
31+
results. Historically, there is only an [informal
32+
definition](https://github.com/microsoft/directxshadercompiler/wiki/wave-intrinsics#operation)
33+
of which threads should be active at any point in execution of the shader:
34+
"<i>implementations must enforce that the number of active lanes exactly
35+
corresponds to the programmer’s view of flow control</i>".
36+
37+
When lowering HLSL to SPIR-V, we must make sure the output matches this
38+
expectation. To do so, there are 2 areas that need to be looked at:
39+
40+
#### 1. Adding `SPV_KHR_maximal_reconvergence` extension and `MaximallyReconvergesKHR` capability. These are Vulkan-specific.
41+
42+
This is an indicator for the driver compilers to respect the above requirement
43+
downstream. The frontend compilers will append these instructions if the
44+
`-fspv-enable-maximal-reconvergence` flag is set.
45+
46+
#### 2. Ensuring the frontend compilers themselves do not alter the state during optimizations.
47+
48+
This is the place that needs extensive testing. In the example below, a compiler
49+
may reorder the code (e.g SimplifyCFG pass) so that statements are moved
50+
inside the branches, producing incorrect results.
51+
52+
##### Before Optimization
53+
```cpp
54+
if (non_uniform_cond) {
55+
doA();
56+
Out[...] = waveOperations();
57+
} else {
58+
doB();
59+
Out[...] = waveOperations();
60+
}
61+
```
62+
63+
##### After Optimization
64+
```cpp
65+
if (non_uniform_cond) {
66+
doA();
67+
} else {
68+
doB();
69+
}
70+
// Invalid transformation.
71+
Out[...] = waveOperations();
72+
```
73+
74+
This kind of optimization should be prevented. In DXC, spirv-opt is used to
75+
optimize when targeting Vulkan. It is aware of HLSL's
76+
Single-Program-Multiple-Data (SPMD) programming model, since SPIR-V has a
77+
similar programing model.
78+
79+
In Clang, we leverage [control convergence
80+
tokens](https://llvm.org/docs/ConvergentOperations.html#overview) within the IR,
81+
to explicitly mark the convergent operations (i.e. waves) and the convergence
82+
points of the threads executing those instructions, so that optimization passes
83+
can be aware and avoid invalid transformations.
84+
85+
Testing for correct convergence behavior is critical for reliability. Currently,
86+
only a few unit tests exist. We need to extend this coverage to include complex
87+
and highly divergent cases.
88+
89+
## Proposed solution
90+
91+
We propose implementing a comprehensive test suite in the offload-test-suite
92+
repository that mirrors the logic of the Vulkan Conformance Testing Suite
93+
(Vulkan-CTS). This involves generating shaders with random control flows (mixes
94+
of if/switch statements, loops, and nesting) and verifying the results.
95+
96+
### Shader Generation
97+
98+
A large number of shader with random control flow will be generated. These
99+
shaders use fixed input buffers and write results to output buffers to verify
100+
which threads are active at each point in the shader.
101+
102+
### Expected Results
103+
104+
The expected results will be calculated by simulating the execution of the
105+
shader on the CPU using characteristics of the machine, like wave size. This
106+
will ensure that we can get the expected results on any platform.
107+
108+
### Verification
109+
110+
We will generate a set of yaml test files for the offload-test-suite. For each
111+
shader and wave size (4, 8, 16, 32), a test file will be generated that
112+
executes the shader and verifies that the results match the expected results.
113+
114+
## Detailed design
115+
116+
### Test Generation
117+
118+
Logic from [Vulkan
119+
CTS](https://github.com/KhronosGroup/VK-GL-CTS/blob/main/external/vulkancts/modules/vulkan/reconvergence/vktReconvergenceTests.cpp)
120+
will be ported to produce HLSL.
121+
122+
At a high level, each test generation goes through the following steps:
123+
124+
1. Generate instructions with a random control flow.
125+
2. Calculate the expected results (i.e. CPU simulation).
126+
3. Produce the HLSL shader.
127+
4. Format the shader and expected results for offload-test-suite.
128+
129+
130+
This is an [example](https://github.com/llvm/offload-test-suite/pull/685) of the
131+
test generator and the generated
132+
[tests](https://github.com/llvm/offload-test-suite/pull/620).
133+
134+
#### 1. (Pseudo) Random shaders
135+
136+
Random control flow will be produced by a fixed-seed RNG and hard-coded
137+
probabilities. For example, they will determine whether the next instruction
138+
will be a loop, if, switch, etc, and with what conditions. For the pseudo-random
139+
number generator, we will port one from
140+
[llvm::RandomNumberGenerator](https://github.com/llvm/llvm-project/blob/8e335d533682b46289058958456c521df0c8fe32/llvm/include/llvm/Support/RandomNumberGenerator.h#L33C1-L38C42),
141+
which is deterministic and operating system independent.
142+
143+
These random instructions are represented in a custom intermediate
144+
representation, to simplify calculating the expected results during the CPU
145+
simulation and later producing HLSL shaders with correct syntax. Each shader
146+
program is represented as a stack of these IR instructions. e.g `OP_IF`,
147+
`OP_BALLOT`, `OP_DO_WHILE`, etc.
148+
149+
#### 2. Expected results
150+
151+
During the CPU simulation, these instructions are popped from the stack, and for
152+
each instruction, active thread masks are calculated and stored in a separate
153+
stack. This is what will be used to calculate the expected results of operations
154+
when any write happens.
155+
156+
There are two types of write operations, storing 1) indices of active threads,
157+
and 2) a constant value. These values will be kept in a separate vector, and
158+
this is the output buffer we will use for the test verification. They will help
159+
determine whether an invalid compiler transformation happened.
160+
161+
Because the program has a random control flow with a random number of writes,
162+
the size of the output buffer is unknown at the start. Therefore, it will also
163+
be calculated in a separate "dry-run" pass, before running the CPU simulation.
164+
It will simply walk-through the instructions and count the number of writes.
165+
166+
#### 3. HLSL translation.
167+
168+
Once the expected results are calculated, the intermediate representations are
169+
lowered to HLSL. Similar to the CPU simulation, each instruction is popped from
170+
the stack and translated to HLSL. (e.g. `OP_ELECT` --> `WaveIsFirstLane()`
171+
`OP_BALLOT` --> `WaveActiveBallot()`, etc.). This is the part that will be
172+
different from Vulkan-CTS, which produces GLSL shaders.
173+
174+
#### 4. Final test file
175+
176+
At this point, the expected results and shaders are ready to be formatted for
177+
offload-test-suite.
178+
179+
One key thing to note is that each GPU has different wave sizes, and different
180+
wave sizes need different expected results. It's not easy to know the wave size
181+
at the test generation step, since it will require setting up a Graphics
182+
pipeline to query the value.
183+
184+
Therefore, we will prepare the tests in all possible wave sizes (every
185+
power-of-2 between 4 and 32, i.e. 4, 8, 16, 32) and have the test pipeline skip
186+
those that do not match the wave size at test runtime. We will implement
187+
`WaveSizeX` directive and append this condition in the test files. As an
188+
example, a test file will contain `# UNSUPPORTED: !WaveSize32`, and will not
189+
on platforms where the wave size is not 32.
190+
191+
### Workflow Trigger
192+
193+
Only the code for the **random test generator** will reside in the
194+
offload-test-suite repository. The shaders will be generated as part of the
195+
pipeline.
196+
197+
#### CMake Target
198+
199+
We will implement cmake targets `check-hlsl-{platform}-reconvergence`, similar
200+
to the existing targets. Running this command will generate the physical tests
201+
and execute them. We will separate cmake targets for writing the tests so that
202+
the tests will not be regenerated every time the tests are run.
203+
204+
#### Github Workflow
205+
206+
New steps will be added to the existing workflow at the end:
207+
208+
- Build DXC
209+
- Build LLVM
210+
- Dump GPU Info
211+
- Run HLSL Tests
212+
- **Run Reconvergence Tests**
213+
214+
This way, the execution of existing HLSL tests and the reconvergence tests are
215+
separated, and it will be easiser to report and investigate issues.
216+
217+
We don't plan to store the physical test files in the repo. Developers can still
218+
save, run, and inspect the tests locally by running the target in their machine.
219+
220+
### Reporting
221+
222+
Since the output buffer is large, logs can be large if the results don't match.
223+
We will segment the output buffer and verification into multiple buffers and
224+
checks or implment an environment variable to filter out some logs.
225+
226+
If any test fails, it will fail the workflow, so it's noticeable in the badge.
227+
228+
`XFail` instructions will be added appropriately to suppress failures. Since it
229+
is undesirable to change the code of the C++ random test generator every time
230+
failure happens, the test generator may read a structured text file that
231+
contains a list of failing tests and their environments. This way, only this
232+
single file will be updated upon any changes in the compilers, and the algorithm
233+
for generating the tests remains intact.
234+
235+
*reconvergence-failing-tests.txt*
236+
```yaml
237+
reconvergence-test_2_16_7_13_3.test
238+
# Some comment
239+
# XFAIL: Clang && Vulkan
240+
241+
# Some comment
242+
# XFAIL: ...
243+
244+
reconvergence-test_5_32_7_13_1.test
245+
# Some comment
246+
# XFAIL: ...
247+
248+
```
249+
250+
### Latency
251+
252+
The entire Vulkan-CTS test (~1500 shaders) takes ~10 seconds to complete, so the
253+
test generation + execution time should be similar and should not significantly
254+
affect the current pipeline duration. We may also choose to start with smaller
255+
iterations (~100 shaders).
256+
257+
### Debugging
258+
259+
Debugging a failed test will be hard, as a randomly generated shader will not be
260+
so intuitive for readers to calculate the expected result at a given line. There
261+
are several ways to help pinpoint a bug:
262+
263+
- Reducing the workgroup size and/or nesting level.
264+
- Comparing the results with other GPUs and/or backends.
265+
- Writing a reducer for the randomly generated shaders.
266+
267+
It is worth noting that failures may originate from driver compilers rather than
268+
the frontend compilers.
269+
270+
### Sanity Check
271+
272+
A small subset of pre-generated tests may be included in the repository for sanity-check.
273+
274+
## Alternatives considered (Optional)
275+
276+
The proposed solution is the hybrid of the two alternatives considered.
277+
278+
### Option 1: Pre-generate and store all shaders in YAML
279+
280+
This approach involves generating all shaders offline and storing them in the
281+
repository. Although this is a straightforward implementation, it's not
282+
necessary to maintain physical copies of the random shaders. We may later want
283+
to change the parameters of the generator (e.g. seed, nesting level).
284+
285+
### Option 2: Generation and execution in a separate test pipeline
286+
287+
This approach mimics Vulkan-CTS by doing the shader generation, CPU simulation,
288+
and GPU execution in its own custom test pipeline, without storing any physical
289+
copies at any point in time. However, this requires implementing the entire
290+
pipeline from scratch for multiple backends, including DirectX and Metal.
291+
292+
## Acknowledgments
293+
294+
Steven Perron and Nathan Gauër for reviewing the initial planning and
295+
documentation.

0 commit comments

Comments
 (0)