|
| 1 | +--- |
| 2 | +title: "0041 - Testing Maximal Reconvergence" |
| 3 | +params: |
| 4 | + authors: |
| 5 | + - luciechoi: Lucie Choi |
| 6 | + sponsors: |
| 7 | + - s-perron: Steven Perron |
| 8 | + - Keenuts: Nathan Gauër |
| 9 | + - bogner: Justin Bogner |
| 10 | + status: Under Consideration |
| 11 | +--- |
| 12 | + |
| 13 | +* PRs: [Testing in offload-test-suite |
| 14 | + (Draft)](https://github.com/llvm/offload-test-suite/pull/685) |
| 15 | +* Issues: [Implementation in |
| 16 | + Clang](https://github.com/llvm/llvm-project/issues/136930) |
| 17 | + |
| 18 | +## Introduction |
| 19 | + |
| 20 | +This proposal seeks to add comprehensive conformance tests that HLSL compilers |
| 21 | +(DXC and Clang) do not violate the optimization restrictions in section [1.6.3 |
| 22 | +of the HLSL |
| 23 | +specification](https://microsoft.github.io/hlsl-specs/specs/hlsl.pdf?#page=9). |
| 24 | + |
| 25 | +## Motivation |
| 26 | + |
| 27 | +Graphics compilers often perform aggressive optimizations that can unexpectedly |
| 28 | +alter the state of a thread in a wave. This is a critical issue for shaders |
| 29 | +containing operations dependent on which threads are active, such as wave |
| 30 | +intrinsics, as invalid transformations can lead to wrong or indeterminate |
| 31 | +results. Historically, there is only an [informal |
| 32 | +definition](https://github.com/microsoft/directxshadercompiler/wiki/wave-intrinsics#operation) |
| 33 | +of which threads should be active at any point in execution of the shader: |
| 34 | +"<i>implementations must enforce that the number of active lanes exactly |
| 35 | +corresponds to the programmer’s view of flow control</i>". |
| 36 | + |
| 37 | +When lowering HLSL to SPIR-V, we must make sure the output matches this |
| 38 | +expectation. To do so, there are 2 areas that need to be looked at: |
| 39 | + |
| 40 | +#### 1. Adding `SPV_KHR_maximal_reconvergence` extension and `MaximallyReconvergesKHR` capability. These are Vulkan-specific. |
| 41 | + |
| 42 | +This is an indicator for the driver compilers to respect the above requirement |
| 43 | +downstream. The frontend compilers will append these instructions if the |
| 44 | +`-fspv-enable-maximal-reconvergence` flag is set. |
| 45 | + |
| 46 | +#### 2. Ensuring the frontend compilers themselves do not alter the state during optimizations. |
| 47 | + |
| 48 | +This is the place that needs extensive testing. In the example below, a compiler |
| 49 | +may reorder the code (e.g SimplifyCFG pass) so that statements are moved |
| 50 | +inside the branches, producing incorrect results. |
| 51 | + |
| 52 | +##### Before Optimization |
| 53 | +```cpp |
| 54 | +if (non_uniform_cond) { |
| 55 | + doA(); |
| 56 | + Out[...] = waveOperations(); |
| 57 | +} else { |
| 58 | + doB(); |
| 59 | + Out[...] = waveOperations(); |
| 60 | +} |
| 61 | +``` |
| 62 | + |
| 63 | +##### After Optimization |
| 64 | +```cpp |
| 65 | +if (non_uniform_cond) { |
| 66 | + doA(); |
| 67 | +} else { |
| 68 | + doB(); |
| 69 | +} |
| 70 | +// Invalid transformation. |
| 71 | +Out[...] = waveOperations(); |
| 72 | +``` |
| 73 | + |
| 74 | +This kind of optimization should be prevented. In DXC, spirv-opt is used to |
| 75 | +optimize when targeting Vulkan. It is aware of HLSL's |
| 76 | +Single-Program-Multiple-Data (SPMD) programming model, since SPIR-V has a |
| 77 | +similar programing model. |
| 78 | + |
| 79 | +In Clang, we leverage [control convergence |
| 80 | +tokens](https://llvm.org/docs/ConvergentOperations.html#overview) within the IR, |
| 81 | +to explicitly mark the convergent operations (i.e. waves) and the convergence |
| 82 | +points of the threads executing those instructions, so that optimization passes |
| 83 | +can be aware and avoid invalid transformations. |
| 84 | + |
| 85 | +Testing for correct convergence behavior is critical for reliability. Currently, |
| 86 | +only a few unit tests exist. We need to extend this coverage to include complex |
| 87 | +and highly divergent cases. |
| 88 | + |
| 89 | +## Proposed solution |
| 90 | + |
| 91 | +We propose implementing a comprehensive test suite in the offload-test-suite |
| 92 | +repository that mirrors the logic of the Vulkan Conformance Testing Suite |
| 93 | +(Vulkan-CTS). This involves generating shaders with random control flows (mixes |
| 94 | +of if/switch statements, loops, and nesting) and verifying the results. |
| 95 | + |
| 96 | +### Shader Generation |
| 97 | + |
| 98 | +A large number of shader with random control flow will be generated. These |
| 99 | +shaders use fixed input buffers and write results to output buffers to verify |
| 100 | +which threads are active at each point in the shader. |
| 101 | + |
| 102 | +### Expected Results |
| 103 | + |
| 104 | +The expected results will be calculated by simulating the execution of the |
| 105 | +shader on the CPU using characteristics of the machine, like wave size. This |
| 106 | +will ensure that we can get the expected results on any platform. |
| 107 | + |
| 108 | +### Verification |
| 109 | + |
| 110 | +We will generate a set of yaml test files for the offload-test-suite. For each |
| 111 | +shader and wave size (4, 8, 16, 32), a test file will be generated that |
| 112 | +executes the shader and verifies that the results match the expected results. |
| 113 | + |
| 114 | +## Detailed design |
| 115 | + |
| 116 | +### Test Generation |
| 117 | + |
| 118 | +Logic from [Vulkan |
| 119 | +CTS](https://github.com/KhronosGroup/VK-GL-CTS/blob/main/external/vulkancts/modules/vulkan/reconvergence/vktReconvergenceTests.cpp) |
| 120 | +will be ported to produce HLSL. |
| 121 | + |
| 122 | +At a high level, each test generation goes through the following steps: |
| 123 | + |
| 124 | +1. Generate instructions with a random control flow. |
| 125 | +2. Calculate the expected results (i.e. CPU simulation). |
| 126 | +3. Produce the HLSL shader. |
| 127 | +4. Format the shader and expected results for offload-test-suite. |
| 128 | + |
| 129 | + |
| 130 | +This is an [example](https://github.com/llvm/offload-test-suite/pull/685) of the |
| 131 | +test generator and the generated |
| 132 | +[tests](https://github.com/llvm/offload-test-suite/pull/620). |
| 133 | + |
| 134 | +#### 1. (Pseudo) Random shaders |
| 135 | + |
| 136 | +Random control flow will be produced by a fixed-seed RNG and hard-coded |
| 137 | +probabilities. For example, they will determine whether the next instruction |
| 138 | +will be a loop, if, switch, etc, and with what conditions. For the pseudo-random |
| 139 | +number generator, we will port one from |
| 140 | +[llvm::RandomNumberGenerator](https://github.com/llvm/llvm-project/blob/8e335d533682b46289058958456c521df0c8fe32/llvm/include/llvm/Support/RandomNumberGenerator.h#L33C1-L38C42), |
| 141 | +which is deterministic and operating system independent. |
| 142 | + |
| 143 | +These random instructions are represented in a custom intermediate |
| 144 | +representation, to simplify calculating the expected results during the CPU |
| 145 | +simulation and later producing HLSL shaders with correct syntax. Each shader |
| 146 | +program is represented as a stack of these IR instructions. e.g `OP_IF`, |
| 147 | +`OP_BALLOT`, `OP_DO_WHILE`, etc. |
| 148 | + |
| 149 | +#### 2. Expected results |
| 150 | + |
| 151 | +During the CPU simulation, these instructions are popped from the stack, and for |
| 152 | +each instruction, active thread masks are calculated and stored in a separate |
| 153 | +stack. This is what will be used to calculate the expected results of operations |
| 154 | +when any write happens. |
| 155 | + |
| 156 | +There are two types of write operations, storing 1) indices of active threads, |
| 157 | +and 2) a constant value. These values will be kept in a separate vector, and |
| 158 | +this is the output buffer we will use for the test verification. They will help |
| 159 | +determine whether an invalid compiler transformation happened. |
| 160 | + |
| 161 | +Because the program has a random control flow with a random number of writes, |
| 162 | +the size of the output buffer is unknown at the start. Therefore, it will also |
| 163 | +be calculated in a separate "dry-run" pass, before running the CPU simulation. |
| 164 | +It will simply walk-through the instructions and count the number of writes. |
| 165 | + |
| 166 | +#### 3. HLSL translation. |
| 167 | + |
| 168 | +Once the expected results are calculated, the intermediate representations are |
| 169 | +lowered to HLSL. Similar to the CPU simulation, each instruction is popped from |
| 170 | +the stack and translated to HLSL. (e.g. `OP_ELECT` --> `WaveIsFirstLane()` |
| 171 | +`OP_BALLOT` --> `WaveActiveBallot()`, etc.). This is the part that will be |
| 172 | +different from Vulkan-CTS, which produces GLSL shaders. |
| 173 | + |
| 174 | +#### 4. Final test file |
| 175 | + |
| 176 | +At this point, the expected results and shaders are ready to be formatted for |
| 177 | +offload-test-suite. |
| 178 | + |
| 179 | +One key thing to note is that each GPU has different wave sizes, and different |
| 180 | +wave sizes need different expected results. It's not easy to know the wave size |
| 181 | +at the test generation step, since it will require setting up a Graphics |
| 182 | +pipeline to query the value. |
| 183 | + |
| 184 | +Therefore, we will prepare the tests in all possible wave sizes (every |
| 185 | +power-of-2 between 4 and 32, i.e. 4, 8, 16, 32) and have the test pipeline skip |
| 186 | +those that do not match the wave size at test runtime. We will implement |
| 187 | +`WaveSizeX` directive and append this condition in the test files. As an |
| 188 | +example, a test file will contain `# UNSUPPORTED: !WaveSize32`, and will not |
| 189 | +on platforms where the wave size is not 32. |
| 190 | + |
| 191 | +### Workflow Trigger |
| 192 | + |
| 193 | +Only the code for the **random test generator** will reside in the |
| 194 | +offload-test-suite repository. The shaders will be generated as part of the |
| 195 | +pipeline. |
| 196 | + |
| 197 | +#### CMake Target |
| 198 | + |
| 199 | +We will implement cmake targets `check-hlsl-{platform}-reconvergence`, similar |
| 200 | +to the existing targets. Running this command will generate the physical tests |
| 201 | +and execute them. We will separate cmake targets for writing the tests so that |
| 202 | +the tests will not be regenerated every time the tests are run. |
| 203 | + |
| 204 | +#### Github Workflow |
| 205 | + |
| 206 | +New steps will be added to the existing workflow at the end: |
| 207 | + |
| 208 | +- Build DXC |
| 209 | +- Build LLVM |
| 210 | +- Dump GPU Info |
| 211 | +- Run HLSL Tests |
| 212 | +- **Run Reconvergence Tests** |
| 213 | + |
| 214 | +This way, the execution of existing HLSL tests and the reconvergence tests are |
| 215 | +separated, and it will be easiser to report and investigate issues. |
| 216 | + |
| 217 | +We don't plan to store the physical test files in the repo. Developers can still |
| 218 | +save, run, and inspect the tests locally by running the target in their machine. |
| 219 | + |
| 220 | +### Reporting |
| 221 | + |
| 222 | +Since the output buffer is large, logs can be large if the results don't match. |
| 223 | +We will segment the output buffer and verification into multiple buffers and |
| 224 | +checks or implment an environment variable to filter out some logs. |
| 225 | + |
| 226 | +If any test fails, it will fail the workflow, so it's noticeable in the badge. |
| 227 | + |
| 228 | +`XFail` instructions will be added appropriately to suppress failures. Since it |
| 229 | +is undesirable to change the code of the C++ random test generator every time |
| 230 | +failure happens, the test generator may read a structured text file that |
| 231 | +contains a list of failing tests and their environments. This way, only this |
| 232 | +single file will be updated upon any changes in the compilers, and the algorithm |
| 233 | +for generating the tests remains intact. |
| 234 | + |
| 235 | +*reconvergence-failing-tests.txt* |
| 236 | +```yaml |
| 237 | +reconvergence-test_2_16_7_13_3.test |
| 238 | +# Some comment |
| 239 | +# XFAIL: Clang && Vulkan |
| 240 | + |
| 241 | +# Some comment |
| 242 | +# XFAIL: ... |
| 243 | + |
| 244 | +reconvergence-test_5_32_7_13_1.test |
| 245 | +# Some comment |
| 246 | +# XFAIL: ... |
| 247 | + |
| 248 | +``` |
| 249 | + |
| 250 | +### Latency |
| 251 | + |
| 252 | +The entire Vulkan-CTS test (~1500 shaders) takes ~10 seconds to complete, so the |
| 253 | +test generation + execution time should be similar and should not significantly |
| 254 | +affect the current pipeline duration. We may also choose to start with smaller |
| 255 | +iterations (~100 shaders). |
| 256 | + |
| 257 | +### Debugging |
| 258 | + |
| 259 | +Debugging a failed test will be hard, as a randomly generated shader will not be |
| 260 | +so intuitive for readers to calculate the expected result at a given line. There |
| 261 | +are several ways to help pinpoint a bug: |
| 262 | + |
| 263 | +- Reducing the workgroup size and/or nesting level. |
| 264 | +- Comparing the results with other GPUs and/or backends. |
| 265 | +- Writing a reducer for the randomly generated shaders. |
| 266 | + |
| 267 | +It is worth noting that failures may originate from driver compilers rather than |
| 268 | +the frontend compilers. |
| 269 | + |
| 270 | +### Sanity Check |
| 271 | + |
| 272 | +A small subset of pre-generated tests may be included in the repository for sanity-check. |
| 273 | + |
| 274 | +## Alternatives considered (Optional) |
| 275 | + |
| 276 | +The proposed solution is the hybrid of the two alternatives considered. |
| 277 | + |
| 278 | +### Option 1: Pre-generate and store all shaders in YAML |
| 279 | + |
| 280 | +This approach involves generating all shaders offline and storing them in the |
| 281 | +repository. Although this is a straightforward implementation, it's not |
| 282 | +necessary to maintain physical copies of the random shaders. We may later want |
| 283 | +to change the parameters of the generator (e.g. seed, nesting level). |
| 284 | + |
| 285 | +### Option 2: Generation and execution in a separate test pipeline |
| 286 | + |
| 287 | +This approach mimics Vulkan-CTS by doing the shader generation, CPU simulation, |
| 288 | +and GPU execution in its own custom test pipeline, without storing any physical |
| 289 | +copies at any point in time. However, this requires implementing the entire |
| 290 | +pipeline from scratch for multiple backends, including DirectX and Metal. |
| 291 | + |
| 292 | +## Acknowledgments |
| 293 | + |
| 294 | +Steven Perron and Nathan Gauër for reviewing the initial planning and |
| 295 | +documentation. |
0 commit comments