LLM Grandmaster Notes

📚The path to LLM mastery is paved with broken embeddings and resurrected gradients.

base
- transformer
- vit transformer
- lm head
- kv cache
- GPU Architecture
  - SM80
  - SM90
  - SM100
  - SM120
attention
- self attention
- online attention
- flash attention
- flash attention 2
- flash attention 3
- flash decoding
- flash decoding++
- scaled dot-product attention (SDPA)
- multi-head self-attention (MHSA)
- multi-head attention (MHA)
- grouped-query attention (GQA)
- multi-query attention (MQA)
- multi-head latent attention (MLA)
- multi-token attention (MTA)
- sage attention 1
- sage attention 2
- sage attention 2++
- sage attention 3
- paged attention
- ring attention
- ring flash attention
- linear attention
- lightning attention
- native sparse attention (NSA)
- grouped latent attention (GLA)
- grouped-tied attention (GTA)
softmax
- softmax
- safe softmax
- online softmax
kv cache optimization
- sparse
- quantization
- allocator
- window
- share
norm
- Batch Norm
- Layer Norm
- RMS Norm
position embedding
- RoPE
- AliBi
- 2D RoPE
- 3D RoPE
- NTK-Award RoPE
- Yarn
quantization
- smooth quant
- AWQ
- KIVI
- GPTQ
design
- chunked prefill
- continous batching
- speculative decoding
  - Medusa
  - Lookahead decoding
  - NGram
  - OSD
  - Eagle 1,2,3
- sliding window
- multi-token prediction (MTP)
reinforcement learning
- PPO
- GRPO
- DAPO
- GPG
gemm
- deep gemm
- cutlass
  - cooperative and ping-pong gemm scheduler
- cublas
open source
- flash mla
ptx instructions
- mbarrier
- cp.async
- ldmatrix
- mma
- wgmma

Provide feedback

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
assets		assets
attention		attention
base		base
cuda/mma		cuda/mma
gemm		gemm
norm		norm
ptx		ptx
quantization		quantization
softmax		softmax
README.md		README.md