All

15 repositories

ScaleLLM
Public
A high-performance inference system for large language models, designed for production environments.
performance gpu model
performance gpu model production cuda efficiency inference transformer llama speculative
C++
•
Apache License 2.0
•40•497•48•8•Updated Dec 19, 2025Dec 19, 2025
cutlass
Public
CUDA Templates for Linear Algebra Subroutines
C++
•
Other
•1.8k•0•0•0•Updated Nov 6, 2025Nov 6, 2025
nixl
Public
NVIDIA Inference Xfer Library (NIXL)
C++
•
Apache License 2.0
•292•0•0•0•Updated Nov 4, 2025Nov 4, 2025
dynamo
Public
A Datacenter Scale Distributed Inference Serving Framework
Rust
•
Apache License 2.0
•1k•0•0•0•Updated Nov 4, 2025Nov 4, 2025
whl
Public
repository to host python whl package.
HTML
•0•0•0•0•Updated Sep 13, 2025Sep 13, 2025
flux
Public
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
C++
•
Apache License 2.0
•100•0•0•0•Updated Apr 15, 2025Apr 15, 2025
3FS
Public
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
C++
•
MIT License
•1k•0•0•0•Updated Feb 28, 2025Feb 28, 2025
flashinfer
Public
FlashInfer: Kernel Library for LLM Serving
Cuda
•
Apache License 2.0
•903•0•0•0•Updated Feb 27, 2025Feb 27, 2025
FlashMLA
Public
C++
•
MIT License
•1k•0•0•0•Updated Feb 26, 2025Feb 26, 2025
discussions
Public
0•0•0•0•Updated Jun 5, 2024Jun 5, 2024
flash-attention
Public
Fast and memory-efficient exact attention
Python
•
BSD 3-Clause "New" or "Revised" License
•2.6k•0•0•0•Updated Oct 15, 2023Oct 15, 2023
tokenizers
Public
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
Rust
•
Apache License 2.0
•1.1k•0•0•0•Updated Aug 4, 2023Aug 4, 2023
xformers
Public
Hackable and optimized Transformers building blocks, supporting a composable construction.
Python
•
Other
•772•0•0•0•Updated Aug 1, 2023Aug 1, 2023
FasterTransformer
Public
Transformer related optimization, including BERT, GPT
C++
•
Apache License 2.0
•935•0•0•0•Updated Jul 28, 2023Jul 28, 2023
ByteTransformer
Public
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
C++
•
Apache License 2.0
•37•0•0•0•Updated Jul 24, 2023Jul 24, 2023

ProTip! When viewing an organization's repositories, you can use the props. filter to filter by custom property.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorch AI

All

All

15 repositories

ScaleLLM

cutlass

nixl

dynamo

whl

flux

3FS

flashinfer

FlashMLA

discussions

flash-attention

tokenizers

xformers

FasterTransformer

ByteTransformer

All

All

Repositories list

15 repositories