Releases · proger/accelerated-scan

07 Jan 07:35

proger

v0.3.1

866fa4c

0.3.1 — accelerated_scan.complex backward edition Latest

Latest

This release includes correct gradients for the complex scan contributed by @ekellbuch in #15.

Contributors

ekellbuch

Assets 2

25 Dec 04:21

proger

v0.3.0

7d550fe

0.3.0 — accelerated_scan.complex

accelerated_scan.complex now supports long variable-length complex-valued inputs. accelerated_triton has been renamed as accelerated_scan.scalar and now supports variable-length inputs by looping over short chunks (2048 items) of scans, similar to the warp implementation. The triton version has been tested on GB10 (DGX Spark).

Full Changelog: v0.2.0...v0.3.0

Assets 2

20 May 10:34

proger

v0.2.0

db7145f

0.2.0 — faster training!

@unixpickle has fused the sequence reversal required by backward into the kernel and vectorized loads and stores to load entries, training is 30-40 percent faster on 3090.

Contributors

unixpickle

Assets 2

31 Jan 17:18

proger

0.1.2

b7e4770

0.1.2 — reverse reference scan

This release includes reverse=True flag on accelerated_scan.ref.scan.

Full Changelog: 0.1.1...0.1.2

Assets 2

11 Jan 15:24

proger

0.1.1

ad0dbfd

0.1.1 — 16 bit support

This package adds support for float16 and bfloat16 through templating the warp kernel. Below is the plot for max abs errors comparing the reference implementation and the kernel:

Assets 2

10 Jan 10:50

proger

0.1

37e4c0b

0.1

This package implements the fastest first-order parallel associative scan on the GPU for forward and backward.

The scan efficiently solves first-order recurrences of the form x[t] = gate[t] * x[t-1] + token[t], common in state space models and linear RNNs.

The accelerated_scan.warp C++ CUDA kernel uses a chunked processing algorithm that leverages the fastest GPU communication primitives available on each level of hierarchy: warp shuffles within warps of 32 threads and shared memory (SRAM) between warps within a thread block. One sequence per channel dimension is confined to one thread block.

The derivation of Chunked Scan has been used to extend tree-level Blelloch algorithm to block

A similar implementation is available in accelerated_scan.triton using a Triton's tl.associative_scan primitive. It requires Triton 2.2 for its enable_fp_fusion flag.

Assets 2

Releases: proger/accelerated-scan

0.3.1 — accelerated_scan.complex backward edition

Contributors

Uh oh!

0.3.0 — accelerated_scan.complex

Uh oh!

0.2.0 — faster training!

Contributors

Uh oh!

0.1.2 — reverse reference scan

Uh oh!

0.1.1 — 16 bit support

Uh oh!

0.1

Uh oh!